01-24-2023 03:06 AM - edited 01-25-2023 08:03 AM
Unwrapping Xmas Number One Singles through Data
by Jack Mason, Jack McCormack , Loic Poulmarc'k , Markus Bergmaier , Sara Seylani, and Niamh O’Brien
Unwrapping Xmas Number One Singles through Data
The number 1 single on the week of Christmas (known colloquially as just “The Christmas Number 1”) is a festive tradition of growing magnitude in the UK and Ireland and is still very much a coveted and competed for prize. Much like how families debate what ingredients make up the perfect Christmas dinner, what makes a Christmas Number 1 is a hotly debated topic every year. We had several hypotheses as a team about the most influential factors, but instead of debating them out, we decided to analyze the data behind Christmas Number 1s as our team hackathon for this year.
We started this data project as one ought to as a data scientist by deciding on a number of questions that we wanted to answer:
Then it was time to design an architecture that would allow us to: quickly get to the bottom of these questions while also being sufficiently reliable so that we could build on this next year if we so wished.
If you’re the sort of person that likes to read the last page of the book first (!?!) - here’s a link to the results section.
High Level Architecture Diagram for the project
We began ingesting data from the following data sources:
Spotify
Google Trends
Wikipedia
Record Artwork
Fivetran, as the data ingestion solution was the ideal choice for many reasons; specifically crucial for this project was the support of connecting custom data sources via a myriad of methods, normalized schemas and the integration with dbt Core, and Terraform.
This facilitated:
Only the song characteristics for Christmas Number 1s were required for this project, therefore this function retrieves the list of Christmas Number 1s from the S3 bucket containing the Wikipedia data and parses the data to extract chart information.
For each chart data entry, the function searches Spotify for track information and collects all of the information in a list and unique track IDs in a set. With the collected IDs of tracks, it queries the Spotify API to get audio features for each of the tracks.
Then it creates a dictionary that contains all of the above information and other meta information, which is then returned as a JSON string.
This explanation of the function was provided by ChatGPT!
To analyze the cover artwork of a song in a data-driven way, we thought it would be most interesting to identify the objects present in the artwork.
For object recognition, we used AWS Rekognition, which is a computer vision platform that can detect and label objects within images. These labels were integrated into Databricks as a table, ready for analysis, using a Fivetran custom function connector.
The approach is relatively simple:
To save cost, it makes sense to include logic to process each image exactly once. A simple way to do this is by introducing a queuing service like AWS SQS. That way, one can write the images which need to get processed to the queue, while the Lambda function will consume the images it needs to process from there. After processing one image, it gets deleted from the queue by the Lambda function.
The S3 connector was chosen to ingest Wikipedia data on historical Christmas Number 1s because of the ease of setup, and it could integrate seamlessly with the Spotify AWS Lambda function. If you are thinking that we just downloaded the relevant Wikipedia page and uploaded it to S3, you would be correct; we choose pragmatism over over-engineering in this instance.
We developed a local execution of a python script running pytrends to extract data from Google Trends, as Pytrends was performing unreliably due to Google’s rate limitations and anti-scraping detection methods that were more easily skirted when executing locally.
The Fivetran Google Sheet connector then allowed us to quickly and easily integrate the output of this local function into Databricks
The data from all the different sources used in this project were ingested through Fivetran and integrated into Databricks as relational tables.
Databricks was chosen as a destination because we wanted to do conventional analytics (BI) as well as advanced analytics (AI); the Lakehouse concept of Databricks enabled us for both use cases.
The key requirement in choosing a transformation tool/setup was simplicity and time to live. Transformations for dbt Core was the tool of choice for this part of the project for several reasons;
DBT Core is also an open-source project, and should we wish to extend this project next year - we could leverage some of Fivetran’s pre-built dbt models.
Interactive Data Lineage Graph of the modeling layer within the Fivetran UI
To best understand the trends in Christmas Number 1s and perform holistic analytics, we combined all of the data sources to form a final analytics-ready model.
Following dbt best practices, we had a staging layer for data reformatting, an intermediate layer for table join operations and a Mart layer which was exposed to Tableau and AutoML.
Custom calculations and specific use case definitions were handled in the visualization layer. This is because we only wanted to manage the computationally expensive operations in this layer and produce a dataset that is as generic as possible. This is because the goal of the visualization component of the project was to let the consumer explore the dataset themselves.
Tableau was chosen as the data visualization platform as it:
For Advanced Analytics, we used the AutoML feature of Databricks. The only requirement for using AutoML is having a table that includes a target variable and some data points which might help to predict that target.
In our case, the target was the popularity of the song. AutoML can run experiments to derive a fitting Machine Learning model. All these experiments also auto-generate the code which runs the experiment and enables the user to explore further. The machine learning model artifacts of each experiment can be selected for production, and Databrcks can also automatically generate an inference pipeline to apply the model to similar data and to get predictions.
For those of you searching for an orchestration tool like Airflow (here is a link to our provider if you’re interested) in our architecture diagram, you won’t find one; Fivetran handled all the scheduling needs of this project, from source to destination, to analytics-ready data.
In terms of having a consistent and reproducible setup, using infrastructure as code, Terraform in this case is always a good idea. Not only just for standard Fivetran connectors but also AWS Lambda functions are fully maintainable with Terraform. With Terraform, one can automate the build and deployment of the Lambda function while having a local development setup.
We maintained both of the AWS Lambda functions through Terraform. The functions were written in Python, and the code for the Lambda functions required specific packages, which made it necessary to build a code artifact for the function. The ability to test the custom code locally while developing it reduced our development time.
So, what did we unwrap from this data project?
Most common objects in Christmas Number 1 Artwork Covers
The number one trend that emerged from our analysis of record artwork for Christmas Number 1 is to have a human/person in your artwork. If you group person, head, face, adult, male, man, female, woman and people, it represents just over 25% of all objects detected across the entire dataset.
Other than that, the data is rather inconclusive, with the exception of noting that according to the data; males appear more frequently than females in record artwork.
If you’re reading this with the goal of creating your own Christmas Number 1 for next year - we decided it would be interesting to “reverse engineer” what the AI-generated “ideal” Christmas artwork cover would be:
AI-generated Christmas Single artwork based on the most popular objects identified in historical record artwork
When analyzing the characteristics of the most popular Christmas Song (interactive dashboard for self-discovery), according to the data we pulled from Spotify, “Merry Christmas Everyone” brought more (positive) energy than the average of all the Christmas Number 1 songs whilst keeping the level of spoken words low.
AI-generated Christmas Single Artwork based on the most popular objects identified in historical record artwork
If you’re looking for even more data and are planning on using this data to release your own Christmas Number 1 next year, you may also be interested in the correlation between different song characteristics in the advanced analytics section. Top tip - if you want to bring up the energy of your song, crank up the volume!
With AutoML, we ran different experiments on the data to predict popularity. We included all available data, including the one-hot encoded image labels, even though that led to way too many predictors for a data set with just 93 rows. However, it’s a Christmas project, and we were curious.
An experiment then creates multiple models and automatically explores the data. The following chart gets automatically generated for an experiment; it shows all correlations between the metric columns.
Correlation of song characteristics of Christmas Number 1s
All trained models per experiment are displayed in the Databricks UI. One can select the model with the best metrics and investigate it by exploring the whole run step by step in an automatically created Notebook or directly put the model artefact into production to label new data for predicted popularity. We need to wait for the next year, though, to have a new song we can apply the prediction to. 😉
ML Experiment Outputs in Databricks UI
The challenge of analyzing trends in text searches
This analysis really highlights the challenge of text analysis. “Do They Know it’s Christmas” was in fact, as the data suggests; a Christmas Number 1 in 2004. However, “Perfect” by Ed Sheeran was a Christmas Number 1 in 2017 even though the highest number of searches were in 2015. Sometimes the data cannot account for fan fervor, our hypothesis is that people were searching “Ed Sheeran is perfect” pre his release of the single; thus skewing the results.
On an aggregate level, which is sometimes the best way to analyze data if one is simply searching for trends; the data does tend towards increasing search results for the years where the song is a Christmas Number 1 as shown by this excellently color-coordinated dashboard.
Data Analysis on Google Trends Data on Christmas Number 1s
Upon further analysis of the Google Trend data, it was identified that some years produced a considerably higher volume of results than others. However, as so often happens in real-world data projects, this was less of a pattern of interest and more of a data processing challenge.
Having analyzed the data, we identified that Google's rate limiting had resulted in some ‘dark spots’ or gaps in the data where responses were refused, and retries to pull the data also failed.
This could be fixed by capturing the requests that resulted in failures, where retries were also failures, then re-running at a later time to fill these gaps. This process would likely need to be repeated multiple times to get through all gaps.
Hopefully, for those of you who consider yourself as a potential Christmas Number 1 artist; the insights gleaned from this analysis will provide you with a data-driven template for achieving that goal next year.
For those of you that are more interested in the data; this project should highlight that:
If anyone has any suggestions about how we iterate on this for next year, please let us know in the comments!