As different as it is similar
In my Next Frontiers in Entertainment Data Science article on Toward Data Science, I refer to how data science can be applied at various phases of the content lifecycle, from greenlight to production to release. Though it’s easy to conceptualize how applications of data science might differ between, say, deciding what scripts should be greenlit and determining how production costs can be optimized, there can be stark differences even across contexts that, at first sight, might seem relatively similar.
About a year and a half ago, started a new job at a major movie studio. Coming from the streaming tech side of the business, I expected things to be more or less similar, except that this time I’d be working with movie data exclusively rather than both TV and movie data. Predicting how popular things are gonna be using data, how crazy different could things be?
Boy, did I have no clue.
The business is totally different. The questions are different, the stakeholders are different, the data is different, etc. So I wanted to write this piece with two goals in mind. The first more obvious goal is to show aspiring and junior entertainment data professionals how data science work can differ between theatrical and streaming contexts. But I imagine this kind of dynamic can manifest in a lot of different industries — where you think you’ll be doing largely the same predict Y using X thing you always did only to find out it’s entirely different takes on X and Y — so the second broader goal is to give data professionals in all fields an idea of how even though two jobs may seem functionally similar to a remarkable degree on the surface, they can be totally different in various ways once you really start digging in to the data and the business questions at hand.
With that, below are some of my key observations after making the leap from streaming entertainment data science to theatrical entertainment data science. I skip over some of the more blatantly “no duh” points (oh, there’s no theatrical TV show releases, what a surprise), but I touch on some of the major trends. And of course, none of this is some biblical statement of truth; YMMV based on company, team leadership, and the like. Furthermore, although data science can play a role in earlier phases of the entertainment content lifecycle as I allude to above, this piece derives from my experience with more downstream processes nearer to release. If I ever seem a bit ambiguous, that’s deliberately because I don’t wanna spill any of the secret sauce 😉
Scope of Data
The most immediately obvious is the difference in the sheer scope of data. In theatrical data science, the primary unit of analysis is the movie, perhaps the movie-country — and there’s only so many movies that come out in a given country in a given year!
This isn’t to say you never work with larger datasets that come in at more granular levels on the theatrical side; these datasets are generally tied to the title or some element of the title and you often work with them and process them in some way to generate title-relevant insights. But the bottom line is that because the title space is smaller by default, the scope of data is smaller. I do hope that one day, we can get more of the granular, individual-level insight into consumption in the theatrical domain that’s possible in the streaming domain, but at least as it is now, that’s not the case.
No* Historical Data
In streaming, the vast majority of the time (with the obvious exception of streaming exclusive releases), you’re going to have some significant amount of historical data to work with. How much did the title make at the box office? How was the social media buzz around it when it came out? How did the title do on Rotten Tomatoes?
You do not have this luxury in the theatrical space. Sure, you can to some extent lean on the history of particular components, whether they be cast, crew, genre, or some combination, but even then those data points generally will not be as clearly linkable to a particular title as literal historical data. Plus, such comparisons can be riddled with subjectivity concerns and exogenous confounding factors; how is it decided what titles are truly comparable to others? What role does marketing and differences in marketing campaigns play in the public’s perception of similarity between titles?
*Yes, series and franchises are a half-exception to this rule, but over-reliance on sequelitis and similarity presumptions can easily backfire. Yes, in many cases, the performance of a predecessor can be a decent eyeball estimate of its successor, but franchises can lose steam over time or be overextended (new characters/plots and weak linkages to past titles) beyond recognition such that earlier title performance can be meaningless in predicting the performance of newer titles.
Very Particular Data
Coming from startup streaming tech territory with a data team run by tech people, I spent a lot of time doing research on what datasets might be useful for our needs. Over the course of such work, I found lots of obscure datasets fitting various needs and investigated how we might be able to cheaply collect data that vendors sold for a high price (e.g. how can we get Google search data without paying for an expensive license?).
On the theatrical side, the standards and conventions seem far more established. There are certain consumer and social media datasets or dataset types that more or less everyone in the industry uses. For example, while social listening might come to mind as an obvious contemporary data source, there are major established vendors that provide detailed pre-release and post-release consumer data for titles, and some of these vendors have been around for decades. These are the kind of datasets that many outside the immediate theatrical space might never have heard of, but when you’re in the space, they’re all you ever talk about.
No Windows (or, a Single Window)
In the streaming space, the window of availability — and to some extent, the nature of availability — is a huge factor in analyses. Such windows can interact with a variety of content- (e.g. is the title about Christmas and the window is a Christmas window? ) and marketplace-level factors (e.g. is the title being displayed prominently on the front page?).
As you can imagine, these concerns are less present in theatrical data science — or, more accurately, unless you’re doing upstream modeling related to “should we make this title?” or “when should we release this title?” any concerns about windowing factors have already been decided for you in the form of a (likely) release date by the time you get involved. There’s only one window to worry about (unless you have to worry about staggered release dates, and that’s a whole ‘nother ballgame), and the Powers That Be have already decided when it’ll be. Now you need to do your best to provide all the useful insights you can in the context of that window.
Greater Emphasis on the Business
During my time on the streaming side, it was easy to treat titles and audiences likes numbers because we had data on hundreds of thousands of them, but such a philosophy was also reflected in the methodology. It was common to not only generate summary statistics, but also to turn everything into some kind of vector embedding (i.e. a series of numbers that something across some human-unobservable set of dimensions), even if that came at the cost of interpretability; it doesn’t really mean anything to be able to say, “Content dimension 2 is the most important variable in the model”.
On the theatrical side, there’s a greater focus on the business, beyond the numbers. The data isn’t there just for the numbers’ sake, but for the actionable insights they can provide to various stakeholders around the organization, many of whom are not data scientists or even work with data on a day to day basis. Making accurate predictions is important, but just as important is interpretability, and there’s no rush to throw interpretability out the window just for the sake of reducing model error by a small fraction of a percent. In turn, I feel more connected to both the business and audiences in the work I do.
Key Takeaways and Conclusion
I’ve touched on various topics above in the context of my streaming to theatrical jump within the entertainment industry, but the underlying themes entail asking questions relevant to anyone hopping between two loosely similar jobs in the same industry. So to recap, before you go assuming that your next job will be largely more of the same as your last job, based on the differences between streaming and theatrical data science I mention above, here are some questions worth pondering more deeply about as you think about the similarities and differences between your last job and your next job:
- Scope of data: What is the unit of data? How often is the data added to and with how many units each time? As a result, how big is the dataset, and what tools are needed to handle such a dataset?
- Availability of historical data: What kind of historical data is available, if any at all? Is available historical data a direct fit or does it involve some kind of aggregation, imputation, or similarity analysis?
- Data sources: What data sources are used? Are the data sources used more generally relevant or are they very context-specific? How much room is there to experiment with new data sources or to put aside existing data sources? What are the established, conventional datasets that everyone uses?
- Time Elements: What is the time window of relevance for the particular question you must answer? How is it decided? Is it singular or multiple, fixed or shifting? How do you need to account for time and associated factors (e.g. seasonality, holidays, etc.) in the work? Is a particular window of time of greater interest to the business than another?
- Business emphasis: Who is the audience? Given this, what is the balance between accuracy and interpretability that needs to be struck? And how does that in turn affect what kind of features you find useful? How does the pace of the business push the pace of the work?
Clearly, I was hired for my current position because my skill set is relevant to the job duties and what I do is similar to what I did before — but data science on the theatrical side compared to the streaming side is somehow as different as it is similar. As I elaborate above, the data is different, the processes are different, and the expectations are different. I hope you found this article useful if you’re hoping to enter the exciting field of entertainment data science or pondering a switch to a similar but different job in whatever industry you’re in!
At time of writing, Danny Kim (PhD, University of Pennsylvania; Forbes 30 Under 30 2022) is Senior Data Scientist on the Marketing Analytics & Insights team of Sony Pictures Entertainment Motion Picture Group. Danny previously worked at Whip Media and Paramount Pictures, and he is an alumnus of the Annenberg Schools for Communication at Penn and USC; The Wharton School; and the USC School of Cinematic Arts.
Entertainment Data Science: Streaming vs. Theatrical was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Entertainment Data Science: Streaming vs. Theatrical
Go Here to Read this Fast! Entertainment Data Science: Streaming vs. Theatrical