I've done this job in collaboration with Charly Alizadeh.

The competition

The dataset contains audio features of tracks for a certain user from Spotify. The goal of this project is to predict as good as possible whether the user likes a track. Every Track is labeled in the column target:

“1” means the user likes the track. “0” means he doesn’t like it.

There is 16 columns in the dataset. 3 of them describe the track: track’s name, artist, and the target.

The other 13 columns are the audio features of a track:

acousticness.
danceability.
duration_ms.
energy.
instrumentalness.
key.
liveness.
loudness.
mode.
speechiness.
tempo.
time_signature.
valence.

Evaluation criterion:

Two files are given to the students:

data.csv on which you have a response variable “target”. test.csv where you don’t have the response and on which you must apply your models.

Submission Format: csv
Evaluation metric: accuracy (proportion of the correct predictions)
Allotted time: 3 hours

Each group can submit the csv file (up to 20 times) to calculate its score.

Libraries

library(MASS) # Dataset
library(caTools) # Split data
library(ggplot2) # Better plots
library(tree) # Trees
library(rpart) # Regression & classification Trees
library(rpart.plot) # Better plot of trees infos
library(corrplot) # Graphical display of correlation matrix
library(randomForest) # Random forest model
library(gbm) # Boosted model

Importing the data

Firstly, let us import the dataset. We removed the two last columns we don’t need to train our model and separated the data.csv in two subset : test and training set to perform cross validation.

data = read.csv("data.csv")
data = data[,-(15:16)] # Remove song title and Artist
set.seed(1)

split_train = sample.split(data$target,SplitRatio = 0.75)
training_set = subset(data,split_train == TRUE)
test_set = subset(data,split_train == FALSE)

Data Exploration

Each features of the dataset is well describe here. Let us see it by ourselves.

#View(data)

boxplot(speechiness ~ target, data=data, col = "blue4",
        main="Boxplot speechiness ~ target", na.action = na.omit)
boxplot(danceability ~ target, data=data,col = "blue4",
        main="Boxplot danceability ~ target", na.action = na.omit)

ggplot(data, aes(x=valence)) +
  geom_histogram(binwidth=0.1, alpha=.5, fill='blue3') +
  ggtitle("Distribution of values for valence")
ggplot(data, aes(x=loudness)) +
  geom_histogram(binwidth=0.5, alpha=.5, fill='blue3') +
  ggtitle("Distribution of values for loudness")
ggplot(data, aes(x=tempo)) +
  geom_histogram(binwidth=2, alpha=.5, fill='blue3') +
  ggtitle("Distribution of values for tempo")
ggplot(data, aes(x=instrumentalness)) +
  geom_histogram(binwidth=0.1, alpha=.5, fill='blue3') +
  ggtitle("Distribution of values for instrumentalness")

Then we plotted the correlation matrix.

corrplot(cor(data))

We observe that target as low linear correlation with other variables which is normal since there are only two classes to predict. We did tried logistic regression to predict the target responses but this didn’t provide good results.

Metrics measurements

To measure the different metrics of our models we used the following functions:

accuracy = function(confusion_matrix){
    return ((confusion_matrix[1,1] + confusion_matrix[2,2])/sum(confusion_matrix))
}
specificity = function(confusion_matrix){
    return (confusion_matrix[2,2]/(confusion_matrix[2,2] + confusion_matrix[1,2]))
}
sensitivity = function(confusion_matrix){
    return ( confusion_matrix[1,1]/(confusion_matrix[1,1] + confusion_matrix[2,1]) )
}
precision = function(confusion_matrix){
    return ( confusion_matrix[1,1]/(confusion_matrix[1,1] + confusion_matrix[1,1])
    )
}
get_confmat_metrics = function(confusion_matrix){
    acc = accuracy(confusion_matrix);
    spec = specificity(confusion_matrix);
    sens = sensitivity(confusion_matrix);
    prec = precision(confusion_matrix);
    return (data.frame(accuracy = acc,
            specificity = spec,
            sensitivity = sens,
            precision = prec))
}

Experiment

We tested the following model:

LDA
QDA
Bagging
RandomForest
Boosting

We found better results without scaling the data. We tested various scaling method, especially with the preProcessed() function from the caret package. We tested multiple set of features such as target ~ . - acousticness - loudness and target ~ . - acousticness - loudness - valence but in the end the best result was found with target ~ ..

General testing model function

In the boosting model, we tuned the hyperparameters to obtain the best predictions. We set the number of trees generated to 7500, the depth to 7 and the shrinkage to 0.01.

build_model = function(training_set, test_set, dependant_variables) {
    # Construct formula
    dependant_str = paste(dependant_variables, collapse="+")
    formula_str = paste("target ~", dependant_str, sep=" ")

    # Building model
    #model = lda(as.formula(formula_str), data=training_set)
    #model = glm(as.formula(formula_str), data=training_set, family=binomial)
    #model = qda(as.formula(formula_str), data=training_set)
    #model = randomForest(as.formula(formula_str), mtry=sqrt(length(trainging_set)))
    model = gbm(as.formula(formula_str),
                data = training_set,
                n.trees=7500,
                interaction.depth=7,
                shrinkage=0.01,
                distribution = 'bernoulli'
    )

    # Make prediction
    predict_set = data.frame(test_set)
    pred_object = predict(model, newdata=test_set, type="response")
    predict_set$target = ifelse(pred_object > 0.5, 1, 0)


    #predict_set$target = pred_object$class

    # Compute prediction metrics
    confusion_matrix = table(test_set$target, predict_set$target)
    print(paste("Formula:", formula_str, sep=" "))
    print("Confusion matrix")
    print(confusion_matrix)
    #print(paste("AIC: ", summary(model)$aic, sep=" "))
    mat_metrics = get_confmat_metrics(confusion_matrix)
    print(mat_metrics)
    return (model)
}

Scaling data

scale_df = function(dataframe) {
    for(name in colnames(dataframe)) {
        dataframe[name] = scale(dataframe[name])
    }
}

The following code is the one that generated the results:

my_model = build_model(training_set, test_set, c("."))

[1] "Formula: target ~ ."
[1] "Confusion matrix"

      0   1
  0 175  51
  1  46 183
   accuracy specificity sensitivity precision
1 0.7868132   0.7820513   0.7918552       0.5

Then we applied our model on the test.csv file and exported as a csv file our predictions. Accuracy obtain after submission on Kaggle : 0.81%.

test = read.csv("test.csv")
test = test[-(15:16)]
pred = predict(my_model, newdata=test, type="response")
pred = ifelse(pred > 0.5, 1, 0)
to_be_submitted = data.frame(id=rownames(test), target=pred)
write.csv(to_be_submitted , file = "to_be_submitted.csv", row.names = F)
print(to_be_submitted)

     id target
1     1      1
2     2      1
3     3      0
4     4      0
5     5      1
6     6      1
7     7      1
8     8      1
9     9      0
10   10      0

Hackathon