Identifying Returning Moviegoers With Machine Learning
At Movio, our core mission is to connect moviegoers to their ideal movie, so everyone can experience the magic of cinema. Because of this, the Data Science team is constantly looking for ways to improve our understanding of moviegoers and their taste. In this post we will take an in-depth look at one of our latest projects, where we set out to find out whether or not it was possible to predict who’s more likely to return to the cinemas as the on-going COVID-19 situation eases out. So without further ado, here we go:
0. TL;DR
- We used XGBoost and loyalty data of 4 US-based exhibitors to predict who’s more likely to return to the cinema after COVID-19.
- Test AUC ROCs were in the 0.75–0.82 range.
- Number of visits in the previous year, spend per visit, and engagement with the loyalty program were the main drivers behind the (predicted) likelihood of visitation.
- These insights and our ability to predict visitation generalize across exhibitors, territories, and movie genres, in our sample.
1. Motivation
COVID-19 has caused a tremendous impact on the movie exhibition business. Theatres around the world have been closed for the best part of last year, and continue to be affected by ongoing restrictions. We want to support our clients, so we decided to try and understand attendance patterns throughout the pandemic, find out what factors were predictive of visitation, and build a segmentation tool able to identify those more likely to attend the cinemas once is safe to do so.
2. Hypothesis
Our hypothesis was simple: visit patterns before the pandemic may predict behaviour through the pandemic, specifically, the level of engagement with the loyalty program and details about the cinema experience. Additionally, we suspected demographics might play a role with some segments being more likely to visit first.
3. Tech Stack
Throughout our experiments we used some cool toys:
- Spark: at Movio we use Spark extensively for our big data needs. Consequently, all of the data-wrangling for this project was done using PySpark’s DataFrame API.
- Pandas: Pandas is the lifeblood of Data Science. We used it here as the bridge between PySpark and the rest of our experimental setup.
- Scikit-Learn: mainly to conveniently split and shuffle our dataset.
- XGBoost: we used this popular implementation of Gradient Boosting Machines (GBM) for its speed and good performance out of the box. We are aware of the native GBM implementation that PySpark offers, but we ultimately decided for PySpark due to the aforementioned reasons.
- SHAP: very cool library for Machine Learning interpretability that leverages the Shapley Value concept from Game Theory. More about it later.
4. Data
We used data from four exhibitors in the United States (US) and one in the Middle East (UAE). For each moviegoer that had come to the movies in the relatively recent past, their pre-pandemic state was put together: number of visits in the prior year (counted from their last visit date), spend per visit, number of emails opened, number of clicked links, whether or not a valid mobile number has been provided. These features were concatenated with moviegoer ethnicity (only available for the US-based exhibitors), moviegoer generation (Baby Boomer, X, Millennial, Z), and moviegoer gender. For each of these categorical features, an unknown feature column was added to model any missing data. Also we found out which of these moviegoers have actually come to the movies in the periods where cinemas have been open after June 2020. This was our ground truth set. Data was shuffled and split 70%-30% between training and validation sets.
5. Model
We trained five Gradient Boosting models (XGBoost, one per exhibitor) on the task of predicting the probability of moviegoer visitation given their pre-pandemic state and their demographics. We used the same hyperparameters for every model:
where pos_weight was computed as the ratio of negative to positive examples. This last hyperparameter proved to be very helpful to manage the imbalance of the dataset, which reflects the extreme impacts of COVID-19 on the industry: in our sample, people who visited the cinema after June 2020 represent less than 9% of our dataset.
6. Results
We used several metrics to assess the performance of our models, namely:
- Precision: the fraction of moviegoers we predict will come to the cinema, that actually come to the cinema.
- Recall: the fraction of moviegoers that will come to the cinema that the model is able to identify.
- F1 measure: A composite metric that incorporates precision (P) and recall (R), to gain a better understanding of performance. It’s defined as the harmonic mean of precision and recall:
- ROC AUC: Stands for Area Under the Receiver Operating Characteristic curve. This metric is better understood as how likely is that our model gives a higher visit probability to people that will come to the cinema, than to people who won’t actually come.
- Precision Lift (%): A comparison between a naïve prediction rule asserting that everybody will just come to the movies and our prediction model, in terms of precision (see above). This is necessary to measure due to the fact that at times a large AUC can hide a model that is not making significantly better than such simple rule. Also, including a baseline it’s a sensible thing to do, as sometimes simple heuristics can perform comparably well to more complex machine learning solutions.
We observed pretty consistent results across the board, with ROC AUCs in the 0.75–0.82 range, and no signs of overfitting. Using moviegoer data all the way back from 2012, (and a 0.5 decision threshold to evaluate precision and recall), we achieved the following results:
It’s worth noticing that as part of our original hypothesis we set out to use only data from moviegoers that had visited in 2019. Later on we decided on using moviegoer data from previous years, leading to further improvements in model AUC, and recall@0.5, with a very slight decrease in precision. Interestingly enough, looking further back into the past got us more negative examples (e.g. moviegoers that no longer visit the cinema would also not visit it after June 2020) and a very modest increase in positive examples. We hypothesize this abundance of negative examples helps the model to “understand” how a non-returning moviegoer looks like.
7. Feature importance analysis
We were interested in understand the factors that were predictive of visitation to inform our clients’ marketing strategies. For this matter we rely on the concept of Shapley Values and the cool SHAP Python library. The Shapley Value is a concept from game theory that formalizes the individual contribution of a player part of a coalition to the attainment of a reward in a game. Shapley Values are the expectation of such contribution over the set of all possible permutations and values of the player coalition, taking into consideration all possible interactions between players. Formally, for a coalitional form game〈N,v〉with a finite set of players N of size n and a function v:2^N→R that describes the total worth of the coalition, the marginal importance of player i can be expressed as
where Sh_i is the individual contribution of player i to the total coalition worth v(n), i.e. its Shapley value. The summation is taken over all possible subsets S⊆N that don’t include player i, and each of its terms captures the effect of player i on the reward attained by each subset, v(S∪{i})-v(S).
Shapley Values can be used to formalize the concept of feature importance in Machine Learning and extend it to highly complex models, by treating model predictions as the worth of a coalition and decomposing the predicted value into a summation of predictor (player) contributions. In the case of a linear model with binary inputs, the Shapley Value of a predictor equals its coefficient.
The SHAP library allows to compute Shapley Values in a straightforward way, and also provides nice visualizations to use. Analyzing our models, we found some interesting bits:
- The number of sessions in the previous year was the most important predictor in four of the five exhibitors we analyzed, while spend per visit was always in the top three.
- For US exhibitor 4, the most important predictor is whether the moviegoer has registered a valid mobile phone number. If they have a valid mobile phone number registered, a moviegoer is more likely to visit the cinema.
- For US exhibitors 1, 2, 3, and the UAE exhibitor, if the gender of the moviegoer is not known, the probability of visiting the cinema goes down.
- In US exhibitor 4 case, engagement with email campaigns (opening the emails, clicking the links) impacts positively the likelihood of visiting. For the other exhibitors the results are not clear cut. These differences are not surprising, since each exhibitor handles email campaigns differently (i.e. email templates, frequency, etc.).
- For US exhibitor 4, the feature movie_generation_unknown is correlated negatively with the likelihood of visitation. This feature is computed using the moviegoer age, which is routinely collected as part of loyalty programs.
- For the UAE exhibitor, the picture is pretty stable, with the probability of visitation mostly explained by the same features than its American counterparts, i.e. number of visits, spend, having valid gender and age information registered, and engagement with marketing campaigns.
- Each exhibitor seems to have a different age and ethnicity audience profile, which leads to some of the related features to be more or less important in each case.
The takeaway here seems to be that other than the frequency with which people comes to the cinema and the amount of money spent as part of the movie-going experience, the level of engagement with the loyalty program (i.e., having submitted valid personal information), and in some cases the level of interaction with marketing campaigns, are the main driver behind the likelihood of visit throughout the COVID-19 pandemic according to our models. These insights hold as well in the case of the Middle East exhibitor, which is good evidence for the generalizability of our findings.
7.1. Analysis of recent movies
We decided to examine more closely what’s going on around some titles released recently, namely Mortal Kombat (Fantasy, Action, Adventure), Godzilla v. Kong (Science Fiction, Action, Drama), and Tom & Jerry (Comedy, Family, Animation). These titles so far represent a large share of the box office in 2021, as part of the recovery of the exhibition business in the post-COVID era. Using data from US Exhibitor 1, we restricted our feature attribution analysis to these movies, and we found a familiar picture, where the attendance likelihood is being driven quite strongly by whether moviegoers have submitted valid personal data as part of the loyalty program (mobile number, age), frequency of visitation in the last year, spend, and the engagement with marketing campaigns (clicking email links). These insights are consistent with the results we discussed before, and suggest trends generalize across movie genres:
Another insight of interest suggested by our results is, Baby Boomers are less likely to visit when compared to other demographics (Millennials and Zoomers). This could be due to them being extra-careful given their higher risk to experience adverse outcomes associated with COVID-19.
8. Inference w/ PySpark
Once we had a model with a performance level we felt comfortable with, we set out to generate predictions for all moviegoers in the loyalty programs of each exhibitor. These predictions will be used to compute segments to be used as part of marketing campaigns.
However, some of our exhibitors are quite large (for one specific exhibitor over 7M different moviegoers visited during 2019) so just loading everything into memory won’t cut it. For this reason we used PySpark Pandas UDFs to execute the inference task. Pandas UDFs (also known as vectorized UDFs) are an efficient way to execute Python code on top of Spark’s distributed infrastructure.
As seen next, our vectorized UDF inference_func receives an Iterator of Pandas DataFrames and returns another Iterator of Pandas Dataframes. Inside the UDF the model gets called and predictions are computed, and all this happens in the executor nodes. Using this pattern, we can scale up our inference task transparently.
9. Cross-exhibitor and cross-territory prediction
Finally, we wanted to understand the generalization properties of the prediction models we had created. For this matter we assessed the performance of a model trained on US exhibitor 1 data, using US exhibitor 4 loyalty data. We found that in this particular case, the results tracked closely the performance of the model on the US exhibitor 1 test set in terms of ROC AUC, and better f1/recall/precision @0.5:
Given that the results suggest our models have good generalization abilities, and that the same features explain the likelihood of visitation across exhibitors in two different countries, we decided to evaluate a model trained on US exhibitor 1 data using the UAE exhibitor data. The results were very consistent with what we had already seen, in terms of both ROC AUC and precision/recall, with a slight increase in performance (ROC AUC), we hypothesize, due to the larger size of the US exhibitor:
10. Conclusion
This was an interesting project altogether, and a nice example of how Machine Learning can be used to do precision marketing. Using XGBoost, PySpark and Movio data, we were able to build a segmentation tool to be used to help our customers in the exhibition industry to identify those who are more likely to visit the cinema as soon as it’s safe to do so. By analyzing feature importance via Shapley Values we were able to uncover interesting insights that will inform the marketing strategies of the exhibitors we work with around the world.
For more on the results above visit our Marketing overview here.