Fivetran Community

niamhobrien · ‎01-24-2023

The Spirit of Xmas in the Time of AI

Unwrapping Xmas Number One Singles through Data

by Jack Mason, Jack McCormack , Loic Poulmarc'k , Markus Bergmaier , Sara Seylani, and Niamh O’Brien

Unwrapping Xmas Number One Singles through Data

The number 1 single on the week of Christmas (known colloquially as just “The Christmas Number 1”) is a festive tradition of growing magnitude in the UK and Ireland and is still very much a coveted and competed for prize. Much like how families debate what ingredients make up the perfect Christmas dinner, what makes a Christmas Number 1 is a hotly debated topic every year. We had several hypotheses as a team about the most influential factors, but instead of debating them out, we decided to analyze the data behind Christmas Number 1s as our team hackathon for this year.

A Data-Driven Xmas Number One Single

We started this data project as one ought to as a data scientist by deciding on a number of questions that we wanted to answer:

Are there any patterns to the record artwork of an album? Is there always a festive symbol present?
Do the characteristics of the song have a particular pattern? Song length? Ratio of lyrics to instrumental?
How about the number of times the artist or track is searched in Google?

Then it was time to design an architecture that would allow us to: quickly get to the bottom of these questions while also being sufficiently reliable so that we could build on this next year if we so wished.

If you’re the sort of person that likes to read the last page of the book first (!?!) - here’s a link to the results section.

A Modern Data Stack Approach

High Level Architecture Diagram for the project

Data Sources

We began ingesting data from the following data sources:

Spotify

The Spotify API provides a range of audio characteristics of songs within its database, such as “danceability” (these features and their matching descriptions can be found here)
Pulled each of these characteristics for the sample of songs used for this project.

Google Trends

Extracted historical popularity (search volume) of each Christmas Number 1s dating back to 2006

Wikipedia

Extracted a list of all historical Christmas Number 1s (artist and record name) in the United Kingdom and Ireland

Record Artwork

Leveraged Amazon’s Computer Vision platform, Rekognition, to detect objects in images (record artwork images in this case) and extract text labels of these objects to be used for analyses e.g. is there a human present in the image (label = “human”).

Data Pipelines

Fivetran, as the data ingestion solution was the ideal choice for many reasons; specifically crucial for this project was the support of connecting custom data sources via a myriad of methods, normalized schemas and the integration with dbt Core, and Terraform.

This facilitated:

Fast and flexible data ingestion - we could choose the method that aligned best with the type of data we needed to ingest
Data that was either analytics-ready or modeling-ready without manual input
Data transformations that are fully integrated with our data pipelines. We could also live collaborate on these whilst maintaining version control, and we could seamlessly monitor and visualize with minimal setup.
Efficient and version-controlled resource deployment so that we can easily recreate this infrastructure next year

Cloud Functions

AWS Lambda - Spotify

Only the song characteristics for Christmas Number 1s were required for this project, therefore this function retrieves the list of Christmas Number 1s from the S3 bucket containing the Wikipedia data and parses the data to extract chart information.

For each chart data entry, the function searches Spotify for track information and collects all of the information in a list and unique track IDs in a set. With the collected IDs of tracks, it queries the Spotify API to get audio features for each of the tracks.

Then it creates a dictionary that contains all of the above information and other meta information, which is then returned as a JSON string.

This explanation of the function was provided by ChatGPT!

AWS Lambda - Rekognition

To analyze the cover artwork of a song in a data-driven way, we thought it would be most interesting to identify the objects present in the artwork.

For object recognition, we used AWS Rekognition, which is a computer vision platform that can detect and label objects within images. These labels were integrated into Databricks as a table, ready for analysis, using a Fivetran custom function connector.

The approach is relatively simple:

Fivetran invokes the Lambda function
Image by image, the Lambda function calls the AWS Rekognition API and gets the labels as a response
Labels are returned in bulk as the response for Fivetran.

To save cost, it makes sense to include logic to process each image exactly once. A simple way to do this is by introducing a queuing service like AWS SQS. That way, one can write the images which need to get processed to the queue, while the Lambda function will consume the images it needs to process from there. After processing one image, it gets deleted from the queue by the Lambda function.

File Connectors

S3

The S3 connector was chosen to ingest Wikipedia data on historical Christmas Number 1s because of the ease of setup, and it could integrate seamlessly with the Spotify AWS Lambda function. If you are thinking that we just downloaded the relevant Wikipedia page and uploaded it to S3, you would be correct; we choose pragmatism over over-engineering in this instance.

Google Sheets

We developed a local execution of a python script running pytrends to extract data from Google Trends, as Pytrends was performing unreliably due to Google’s rate limitations and anti-scraping detection methods that were more easily skirted when executing locally.

The Fivetran Google Sheet connector then allowed us to quickly and easily integrate the output of this local function into Databricks

Destination

The data from all the different sources used in this project were ingested through Fivetran and integrated into Databricks as relational tables.

Databricks was chosen as a destination because we wanted to do conventional analytics (BI) as well as advanced analytics (AI); the Lakehouse concept of Databricks enabled us for both use cases.

Transformations

The key requirement in choosing a transformation tool/setup was simplicity and time to live. Transformations for dbt Core was the tool of choice for this part of the project for several reasons;

It integrates seamlessly with Fivetran (thus removing the need for an orchestration component)
Facilitates collaboration and versioning through git
Automatically generates lineage
Manages upstream dependencies
SQL based modeling

DBT Core is also an open-source project, and should we wish to extend this project next year - we could leverage some of Fivetran’s pre-built dbt models.

Interactive Data Lineage Graph of the modeling layer within the Fivetran UI

To best understand the trends in Christmas Number 1s and perform holistic analytics, we combined all of the data sources to form a final analytics-ready model.

Following dbt best practices, we had a staging layer for data reformatting, an intermediate layer for table join operations and a Mart layer which was exposed to Tableau and AutoML.

Custom calculations and specific use case definitions were handled in the visualization layer. This is because we only wanted to manage the computationally expensive operations in this layer and produce a dataset that is as generic as possible. This is because the goal of the visualization component of the project was to let the consumer explore the dataset themselves.

Business Intelligence

Tableau was chosen as the data visualization platform as it:

Provides the flexibility to create very customized, artistic and story-creating visualizations; such as visualizing data as a Christmas Tree
Facilitates publishing and sharing in the best manner possible via Tableau Public (free)
Has a native integration with Databricks for ease of integration
Facilitates quick and easy Data Discovery which was important for this project given we were all unfamiliar with the data sources and had limited time

Advanced Analytics

For Advanced Analytics, we used the AutoML feature of Databricks. The only requirement for using AutoML is having a table that includes a target variable and some data points which might help to predict that target.

In our case, the target was the popularity of the song. AutoML can run experiments to derive a fitting Machine Learning model. All these experiments also auto-generate the code which runs the experiment and enables the user to explore further. The machine learning model artifacts of each experiment can be selected for production, and Databrcks can also automatically generate an inference pipeline to apply the model to similar data and to get predictions.

Orchestration

For those of you searching for an orchestration tool like Airflow (here is a link to our provider if you’re interested) in our architecture diagram, you won’t find one; Fivetran handled all the scheduling needs of this project, from source to destination, to analytics-ready data.

Data Ops Management

In terms of having a consistent and reproducible setup, using infrastructure as code, Terraform in this case is always a good idea. Not only just for standard Fivetran connectors but also AWS Lambda functions are fully maintainable with Terraform. With Terraform, one can automate the build and deployment of the Lambda function while having a local development setup.

We maintained both of the AWS Lambda functions through Terraform. The functions were written in Python, and the code for the Lambda functions required specific packages, which made it necessary to build a code artifact for the function. The ability to test the custom code locally while developing it reduced our development time.

Xmas Number 1s Unwrapped

So, what did we unwrap from this data project?

A Picture Tells a Thousand Words

Most common objects in Christmas Number 1 Artwork Covers

The number one trend that emerged from our analysis of record artwork for Christmas Number 1 is to have a human/person in your artwork. If you group person, head, face, adult, male, man, female, woman and people, it represents just over 25% of all objects detected across the entire dataset.

Other than that, the data is rather inconclusive, with the exception of noting that according to the data; males appear more frequently than females in record artwork.

If you’re reading this with the goal of creating your own Christmas Number 1 for next year - we decided it would be interesting to “reverse engineer” what the AI-generated “ideal” Christmas artwork cover would be:

AI-generated Christmas Single artwork based on the most popular objects identified in historical record artwork

A Little Less Conversation a little more Action

When analyzing the characteristics of the most popular Christmas Song (interactive dashboard for self-discovery), according to the data we pulled from Spotify, “Merry Christmas Everyone” brought more (positive) energy than the average of all the Christmas Number 1 songs whilst keeping the level of spoken words low.

AI-generated Christmas Single Artwork based on the most popular objects identified in historical record artwork

If you’re looking for even more data and are planning on using this data to release your own Christmas Number 1 next year, you may also be interested in the correlation between different song characteristics in the advanced analytics section. Top tip - if you want to bring up the energy of your song, crank up the volume!

Advanced Analytics Results

With AutoML, we ran different experiments on the data to predict popularity. We included all available data, including the one-hot encoded image labels, even though that led to way too many predictors for a data set with just 93 rows. However, it’s a Christmas project, and we were curious.

An experiment then creates multiple models and automatically explores the data. The following chart gets automatically generated for an experiment; it shows all correlations between the metric columns.

Correlation of song characteristics of Christmas Number 1s

All trained models per experiment are displayed in the Databricks UI. One can select the model with the best metrics and investigate it by exploring the whole run step by step in an automatically created Notebook or directly put the model artefact into production to label new data for predicted popularity. We need to wait for the next year, though, to have a new song we can apply the prediction to. 😉

ML Experiment Outputs in Databricks UI

What’s in a Trend?

The challenge of analyzing trends in text searches

This analysis really highlights the challenge of text analysis. “Do They Know it’s Christmas” was in fact, as the data suggests; a Christmas Number 1 in 2004. However, “Perfect” by Ed Sheeran was a Christmas Number 1 in 2017 even though the highest number of searches were in 2015. Sometimes the data cannot account for fan fervor, our hypothesis is that people were searching “Ed Sheeran is perfect” pre his release of the single; thus skewing the results.

On an aggregate level, which is sometimes the best way to analyze data if one is simply searching for trends; the data does tend towards increasing search results for the years where the song is a Christmas Number 1 as shown by this excellently color-coordinated dashboard.

Data Analysis on Google Trends Data on Christmas Number 1s

Upon further analysis of the Google Trend data, it was identified that some years produced a considerably higher volume of results than others. However, as so often happens in real-world data projects, this was less of a pattern of interest and more of a data processing challenge.

Having analyzed the data, we identified that Google's rate limiting had resulted in some ‘dark spots’ or gaps in the data where responses were refused, and retries to pull the data also failed.

This could be fixed by capturing the requests that resulted in failures, where retries were also failures, then re-running at a later time to fill these gaps. This process would likely need to be repeated multiple times to get through all gaps.

That’s a Wrap

Hopefully, for those of you who consider yourself as a potential Christmas Number 1 artist; the insights gleaned from this analysis will provide you with a data-driven template for achieving that goal next year.

For those of you that are more interested in the data; this project should highlight that:

Regardless of the custom data sources that are important to your organization; Fivetran will have a method to present that data in a ready to use format.
Integrating Fivetran into your existing tech stack is made possible by Fivetran’s range of developer tools and integration with dbt Core.

If anyone has any suggestions about how we iterate on this for next year, please let us know in the comments!