Sparkify: A churn analysis using Raw Events and PySpark

Music is an important aspect of our lives; we listen to music while we are working out, on the bus or trains, when cooking or washing the dishes or when we train for a marathon. People might listen to music couple of minutes every day or even every minute of the day.


Even with a huge volume of different music stream application, Spotify, TIDAL, Amazon Music, Pandora, Google Play, Apple Music to name a few, one key characteristic is present in all of them; there is no escape of constant advertisements that will interrupt your session if you are in a free tier.

Isn’t it frustrating?!

On the other hand, the majority of people do not wish to spend money in a music streaming service that can’t provide the songs they love, cool playlist and fun and easy interfaces.

User engagement, retention, customer loyalty and churn have always been essential subjects for businesses. Machine Learning and predictive analytics helps businesses to gain an endless insight from the persistent increasing catalog of user behavior. Especially nowadays, in the era of Big Data, the opportunities and the possibilities of using the information of those data are countless. Data Science and Artificial intelligence could draw the marketing strategies of companies and provide the right information on which user is a potential canceller so they can send offers and discounts to prevent him from cancelling or even to increase the customer loyalty in some cases.


Enter Sparkify, a fictional yet popular music streaming app created by Udacity. Sparkify offers two tiers to users; a free tier which generates revenue by providing periodic advertisements to the users and the paid tier which allows users to listen to music unstop and without ads. Sparkify uses may register to use the app, both paid and free tier, and they have the ability to cancel the entire service at any moment or downgrade from paid to free tier at any moment as well.

Sparkify records every action taken by a user, e.g listened songs, liked songs, visiting the Home Page, added song to a playlist and etc. Sparkify has the great potential to leverage this huge volume of data to discover users that will probably downgrading or potential cancellers, and Sparkify could draw marketing strategies to retain those users from downgrade or cancellation target and save every penny possible.

Problem Statement

This project aims to address a classic churn prediction problem. My main goal in this project is to train a machine learning model, a binary classifier based on features extracted from raw data of user and their activities and interactions with the Sparkify service, which will predict the users who want to cancell the music streaming service before they actually do. A strong ML model will actually be able to predict the potential cancellers and alert the company so marketing offers and discount would be sent to the user and prevent him from cancelling.

I worked this project as my capstone project which is part of Udacity’s Data Science NanoDegree program. Udacity has provided two datasets to explore: a 128MB mini subset and a full-size 12GB dataset. Due to the high volume of data the project was implemented using the Apache Spark distributed cluster computing infrastructure using the PySpark, Python API for Spark.For this project, I used PySpark running in local mode for the the mini dataset.

Loading and Cleaning

To use Spark, our first step is to set up a Spark session, even if you are using a local cluster and not having an actual server.

from pyspark.sql import SparkSession# Create a Spark session
spark = spark = SparkSession \
.builder \
.appName("Sparkify Churn Prediction") \

The mini-dataset file that we used is the mini_sparkify_event_data.json. I will skip the boring loading commands and bits, however, feel free to check my Github repo in case you are interesting.

Let’s take a look at the data schema:

|-- artist: string (nullable = true)
|-- auth: string (nullable = true)
|-- firstName: string (nullable = true)
|-- gender: string (nullable = true)
|-- itemInSession: long (nullable = true)
|-- lastName: string (nullable = true)
|-- length: double (nullable = true)
|-- level: string (nullable = true)
|-- location: string (nullable = true)
|-- method: string (nullable = true)
|-- page: string (nullable = true)
|-- registration: long (nullable = true)
|-- sessionId: long (nullable = true)
|-- song: string (nullable = true)
|-- status: long (nullable = true)
|-- ts: long (nullable = true)
|-- userAgent: string (nullable = true)
|-- userId: string (nullable = true)

and the first 3 rows of data:

[Row(artist='Martha Tilston', auth='Logged In', firstName='Colin', gender='M', itemInSession=50, lastName='Freeman', length=277.89016, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Rockpools', status=200, ts=1538352117000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30'),
Row(artist='Five Iron Frenzy', auth='Logged In', firstName='Micah', gender='M', itemInSession=79, lastName='Long', length=236.09424, level='free', location='Boston-Cambridge-Newton, MA-NH', method='PUT', page='NextSong', registration=1538331630000, sessionId=8, song='Canada', status=200, ts=1538352180000, userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.103 Safari/537.36"', userId='9'),
Row(artist='Adam Lambert', auth='Logged In', firstName='Colin', gender='M', itemInSession=51, lastName='Freeman', length=282.8273, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Time For Miracles', status=200, ts=1538352394000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30')]

The dimension of our dataset is 286500 rows and 18 columns. After dropping the missing values in the UseId column, we discovered that missing userId values are actually represented by an empty string as shown below.

| |
| 10|
| 100|

In case the user didn’t log in or haven’t registered yet, an empty string is generated as userID. This value could contribute little to zero to our analysis. As a result, we removed these rows with empty string as userId.

After cleaning the data, here is a snick pick of the raw data tracked by the app:

Raw data

The app records demographic data of the user such as gender, geographic data, time stamps and the artist, listened songs, length of listening and relevant data. The column page plays an important role in our analysis as it reviles tracking information of the user for instance if he listened the next song or if he logged out the app.

Exploratory Data Analysis

In our preliminary analysis of the dataset, we discovered that every action of the user in Sparkify is recorder, so the granularity of the dataset is at the event level.

Two records are very important to our churn prediction: Cancellation Confirmation and Submit Downgrade. These events show that the user has cancelled from the music services or downgraded from paid to free tier respectively. We utilised those events to separate the user according to their behaviour and determine if the user’s behaviour is a Churn or a Downgrade as shown in the table below.

|100010| 0| 0|
|200002| 0| 0|
| 125| 1| 0|
| 124| 0| 0|
| 51| 1| 0|
| 7| 0| 0|
| 15| 0| 0|
| 54| 1| 1|
| 155| 0| 0|
|100014| 1| 0|
| 132| 0| 0|
| 154| 0| 0|
| 101| 1| 0|
| 11| 0| 1|
| 138| 0| 0|
|300017| 0| 0|
|100021| 1| 0|
| 29| 1| 0|
| 69| 0| 0|
| 112| 0| 0|

Afterwards, we visualised some useful user statistics, which will help us understand better the overall behaviour of the users and investigate if there is an easy way to predict whether a user has a great chance of cancelling our music services.

i) Total number of listened songs vs user churn

Image 1: User behaviour regarding the listened songs

As we can see from the violin plot above (Image 1), the total number of listened songs don’t vary a lot between female and male cancelers or non-cancellers.

ii) Total number of liked songs vs user charn

Image 2: User behaviour regarding the liked songs

As we can see from the violin plot above (Image 2), the users total liked songs don’t vary between female and male cancellers or non-cancellers.

iii) Churn pattern per gender

Image 3: Churns per gender

The bar plot (Image 3) above shows that the male users are slightly likely to cancel their subscription.

iv) Total number played songs per session vc churn

Image 4: Songs per session

The violin plot (Image 4) above shows that the cancellers played fewer songs per session than the non-cancellers.

v) User churn point

Image 5: User churn point

As we can see in Image 5, cancellation usually happens when the user in in the paid tier.

vi) Registration time until the cancellation

Image 6: Registration time until the cancellation

Finally, the box plot (Image 6) above illustrates that cancellers used the music streaming services for a shorter period of time.

Feature Engineering

Although the dataset has a lot of raw information in it, I could make multiple combinations and calculations to built useful features.

Hence, the first feature I extracted from the raw data is the registration time, which is the time period since the registration of the user. The Regirtation period could reflect the user loyalty and engagement. The second feature would be the total number of listened songs. The high number of listened songs indicated that the user spend more time with our service, and the user built a deeper engagement which leads to the unlike possibility of cancellation of our services. The third generated feature is the total number of thumbs up. This will reflect the quality of our service and the user engagement. Consequently, the fourth generated feature is the total number of thumbs down for the same reason as the feature number 3.

The feature number fifth will be the total number of songs added to a playlist. The purpose of creating this feature is the same as the 3rd and 4th features. The sixth feature we built is the total number of added friends. The higher the number, the deeper the engagement of the user with our services. The seventh generated feature is the total listening length, serving the same purpose as the total number of listened songs, however within a time perspective.

The feature number eight is the the average number of listened songs per session. The higher the number, the more time the user spend in our services.

The ninth feature is the gender, as we would like to inject demographic information about the user in our model as well. This feature will help us identify the differences between genders in terms of chance to churn. The tenth feature we generated is the number of listened artists. The higher the number, the more interesting and “unreplacable” our services are to the user.

The following table provides an overview of our feature list:

| registration_time|total_num_listened_songs|total_num_thumb_up|total_num_thumb_down|total_songs_added_to_playlist|total_num_added_friend|total_listening_time|avg_num_listened_songs|gender|num_of_listened_artists|label|
| 55.6436574074074| 381| 17| 5| 7| 4| 66940.89735000003| 39.285714285714285| 1| 252| 0|
| 70.07462962962963| 474| 21| 6| 8| 4| 94008.87593999993| 64.5| 0| 339| 0|
| 71.31688657407408| 11| 0| 0| 0| 0| 2089.1131000000005| 8.0| 0| 8| 1|
|131.55591435185184| 4825| 171| 41| 118| 74| 1012312.0927899999| 145.67857142857142| 1| 2232| 0|
|19.455844907407407| 2464| 100| 21| 52| 28| 523275.8428000004| 211.1| 0| 1385| 1|


After generated our feature list, we need to train and evaluate a few models and select the best one. But the first step you should follow is to choose the evaluation metric of our models. Our evaluation metric will be the F-1 score. For evaluating the efficacy of our model we need a simple measure of the precision, e.g. whether we need to send a special offer to the right person, and the recall, e.g. whether we miss one user that we should have sent the offer, thus we choose the F1 score. Moreover we report accuracy as well.


Image: Accuracy (Wikipedia

Accuracy is also used as a statistical measure of how well a binary classification test correctly identifies or excludes a condition. — Wikipedia

Accuracy is a one of our chosen performance measure. This is the ratio of correct prediction to the total number of predictions. This metric is good in datasets where false positives and false negatives have the same cost, e,g, the balanced dataset . In our case, it is better to use another metric to assess the effectiveness, however, for investigation purposes only we will report accuracy as well.

The F1 score is the harmonic mean of the precision and recall. The more generic score applies additional weights, valuing one of precision or recall more than the other. — Wikipedia.

Image: F1 score (Wikipedia)

F1-score is our chosen performance metric as it considers both false positives and false negatives. Hence, this metric is ideal in situation there the class distribution in the dataset is unbalanced. The main goal in this project is to identify the majority of users of the music streaming services who might cancel their subscription but at the same time we don’t want to give a huge amount of marketing offers and discounts for no reason (false positives), meaning to users who have no intention of cancelling, or miss out on the actuall cancellers. (false negatives).

We need to identify the user who are likely to cancel their subscription and provide them some special offers in order to persuade them to keep using our services, however we do not want to waste money and resources and send way to many offers especially to users that are unlikely to churn. Our second step would be to vectorize our features. Next, we also need to standardize the features so as to avoid a feature with higher scale dominant the whole model. We will achieve that by extract the mean of each feature and divide it with the standard deviation

Afterwards, we split the dataset into train, validation and test sets.

train, rest = data.randomSplit([0.6, 0.4], seed=42)
validation, test = rest.randomSplit([0.5, 0.5], seed=42)

As we can see, we used two times the split dataset method randomSplit of a Spark data frame , and we used the random seed parameter to ensure we will get repeatable results each time we run the code.


We evaluate two baseline models according to the churn type, specifically one with all users labelled as churn = 0 and the second one with all users labelled as churn = 0. Afterwards, we calculated the evaluation metrics of the model, e.g. accuracy and F1_score.

Acording to the evaluation results on the test set, the baseline model of labelling all users with churn = 0 performs quite well reporting accuracy of 73,5% and f1 score of 62,3%.


Next we evaluated four different models. We need to minimize the chance of overfitting, so we used cross validation and grid search to fine tune our model. We will evaluate all the four models on validation set, and we will choose the optimal one based on the F1_score on that. In this stage, default parameter settings were used, due to computational costs of the training.

This is a summary of our model evaluation:

Since there is no difference between the 3 best models, Logistic Regression, Support Vector Machine and Random Forest, in terms of evaluation metrics performance and the ammount of training seconds (even if we care about time resources the difference are minimal), we chose the 2 models that have the best evaluation results and the smallest ammount of training seconds.

Therefore, we chose Logistic Regression and Random Forest models to conduct a grid search to fine tune them and finally select the best one. Moreover, we selected the Random Forest to investigate the importance of each feature.

We used a ParamGridBuilder to construct a grid of parameters to search over.

For the Logistic Regression model, we fine tuned 3 parameters:

i) regParam

ii) fitIntercept

iii) elasticNetParam

with the following values for each parameter:

# build paramGrid
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\

The table below show the results of the Grid search:

Grid search results at the Logistic Regression.

Afterwards, we used the best set of parameters to train the Logistic Regression model and evaluate it on the test set:

r_best = LogisticRegression(maxIter=10, regParam=0.01,fitIntercept= True,elasticNetParam=0.0)
lr_best_model =
final_results = lr_best_model.transform(test)
evaluator_lr = MulticlassClassificationEvaluator(predictionCol="prediction")
print('Evaluation metrics on test set:')
print('Accuracy: {}'.format(evaluator_lr.evaluate(final_results, {evaluator_lr.metricName: "accuracy"})))
print('F1_core:{}'.format(evaluator_lr.evaluate(final_results, {evaluator_lr.metricName: "f1"})))

The results of the evaluation were the following:

Evaluation metrics on test set:
Accuracy: 0.7647058823529411

For the Random Forest model, we fine tuned 2parameters:

i) numTrees

ii) maxDepth

with the following values for each parameter:

paramGrid = ParamGridBuilder() \
.addGrid(r_forest.numTrees, [int(x) for x in np.linspace(start = 10, stop = 50, num = 3)]) \
.addGrid(r_forest.maxDepth, [int(x) for x in np.linspace(start = 5, stop = 25, num = 3)]) \

The table below show the results of the Grid search:

Grid search results at the Random Forest

Afterwards, we used the best set of parameters to train the Logistic Regression model and evaluate it on the test set:

r_forest_best = RandomForestClassifier(numTrees=30,maxDepth=5 )
r_forest_best_model =
final_results = r_forest_best_model.transform(test)
evaluator_r_forest = MulticlassClassificationEvaluator(predictionCol="prediction")
print('Evaluation metrics on test set:')
print('Accuracy: {}'.format(evaluator_r_forest.evaluate(final_results, {evaluator_r_forest.metricName: "accuracy"})))
print('F1_core:{}'.format(evaluator_r_forest.evaluate(final_results, {evaluator_r_forest.metricName: "f1"})))

The results of the evaluation were the following:

Evaluation metrics on test set:
Accuracy: 0.7794117647058824

As a result, the best performing model is the Random Forest model, with parameters numTrees=30,maxDepth=5 with the following results:

Evaluation metrics on test set:
Accuracy: 0.7794117647058824

Choosing the Random Forest model, gave us the opportunity to use the feature importance function of the machine learning model.

It is clear in the barplot above, that registration time into our service actually plays the most crucial role. The fact that the canceller had shorter registration time into our services indicates the bias of the above mentioned feature, and we might need to reconsider our model or reduce the bias with the help of some sort of transofrmation. Furthermore, the total number of added friends, the total number of thumbs down, and the average number of listened songs also appear to be important features. For Instance, we might conclude that our song recommendations egine might not perform well and recommend the wrong songs to the user if the number of thumbs down is too high or we might conclude that a user loves our services and the songs provided in it, if the number of listened songs is high.

Finally, the gender of the user plays almost no important role in our model.


In this project, we built a machine learning model trcapable to predict wheter a user will churn. We performed multiple preprocess steps, namely we removed rows with missing userId, we converted time columns containing timestamp data into a more human readable format, we converted the gender column itno a binary numeric column. We performed an extensive Exploratory Data Analysis, we visualized coll plots that describe valuable statistics of the users, and we engineered 10 features for our machine learning model. Next, we chose 4 different machine leatning models for our analysis: Logistic Regression, Gradient Boosted Trees, Support Vector Machine, and Random Forest to compare and we chose the Logistic Regression and Random Forest, as the most promising models based on the evaluation metrics and the seconds needed for training. Furthermore, we utilized a cross validation and Gridsearch to fine tune both models and based on the evaluation results on the test set we chose the Random Forest as our final model. We achieved about 78% accuracy, and 76% F1 score, which is about 16% improvement compare to our baseline model e.g. sending everyone an offer.


In this project we aim to reveal the Spark environment as an analysis tool for large volume of data that a personal laptop/machine would be probably uncapable of anylizing. Predicting the potential cancellers of services before the actual churn, gives companies the oportunity to send targeted messages and offers and minimize the cost of saving existing customers. Still, engineering appropriate and informative features from the available data is the number one interesting challenge of the project. However, engineer informative features is higly important in builting a good predictive model, and unfortunatelly is a costly and time consuming effort. Exploratory and explanatory data analysis plays an important role in feature engineering as well.


Adding extra domain knowledge and expertise could significantly improve the feature engineering of this project. As the user database grow, more data will be available to analyze with tools such as Spark and the results would be improved importantly. Currently, we utilized for our analusis 450 records of unique users of our services, and only the 60% of them were used for training our machine learning model. However, the model could go though a great improvement if the training sample increase, and the expected model performance will increase as well.