Let's play with clustering techniques ! Here is a basic way to perform k-means & Hierarchical Clustering.
Libraries
setup
Exploratory analysis
pointsCards
Question 3
We will first consider a smaller dataset to easily understand the results of k-means. Create a new dataset in which you consider only Points
and Yellow.cards
from the original dataset. Name it pointsCard
.
Question 4
Apply k-means on pointsCards
. Chose k=2 clusters and put the number of iterations to 20. Store your results into km
.
Question 5
Print and describe what is inside km
.
Length Class Mode
cluster 20 -none- numeric
centers 4 -none- numeric
totss 1 -none- numeric
withinss 2 -none- numeric
tot.withinss 1 -none- numeric
betweenss 1 -none- numeric
size 2 -none- numeric
iter 1 -none- numeric
ifault 1 -none- numeric
List of 9
$ cluster : Named int [1:20] 1 1 1 1 2 2 2 2 2 2 ...
..- attr(*, "names")= chr [1:20] "Paris-SG" "Monaco" "Lyon" "Marseille" ...
$ centers : num [1:2, 1:2] 82 44.8 71.2 71.6
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "1" "2"
.. ..$ : chr [1:2] "Points" "yellow.cards"
$ totss : num 6878
$ withinss : num [1:2] 257 2181
$ tot.withinss: num 2438
$ betweenss : num 4441
$ size : int [1:2] 4 16
$ iter : int 1
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
Inside km we have the cluster attribution of our observations, the centers of each cluster
Question 6
What are the coordinates of the centers of the clusters (called also prototypes or centroids) ?
Points yellow.cards
1 82.00 71.2500
2 44.75 71.5625
Question 7
Plot the data (Yellow.cards vs Points). Color the points corresponding to their cluster.
Question 8
Add to the previous plot the clusters centroids and add the names of the observations.
Question 9
Re-run k-means on pointsCards using 3 and 4 clusters and store the results into km3 and km4 respectively. Visualize the results like in question 7 and 8.
Question 10
Visualize the “within groups sum of squares” of the k-means clustering results (use the code in the link above).
Question 11
Modify the code of the previous question in order to visualize the ‘between_SS / total_SS’. Interpret the results.
for (i in 1:15){
wss[i] <- kmeans(pointsCards,centers=i)$betweenss/kmeans(pointsCards,centers=i)$totss
}
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="between_SS / total_SS", main = "Elbow method")
We observe the elbow for k=4. The rate of decrease/increase drops drastically after k=4. Therefore the best value for k is 4.
Ligue 1
Question 12
Scale the dataset and transform it to a data frame again. Store the scaled dataset into ligue1_scaled
.
Points | Wins | Draws | Loses | Goals.scored | |
---|---|---|---|---|---|
Paris-SG | 2.3182291 | 2.4502435 | -1.3224461 | -1.8871953 | 2.6102823 |
Monaco | 1.5795777 | 1.6224585 | -0.5877538 | -1.3816966 | 1.5448609 |
Lyon | 1.4659390 | 1.4569015 | -0.2204077 | -1.3816966 | 1.6375063 |
Marseille | 1.4091196 | 1.2913445 | 0.5142846 | -1.5501962 | 1.3132476 |
Rennes | 0.3295522 | 0.2980026 | 0.1469385 | -0.3706991 | -0.0764324 |
Question 13
Apply kmeans()
on ligue1
and on ligue1_scaled
using 3 clusters and 20 iterations. Store the results into km.ligue1
and km.ligue1.scaled
respectively (do not forget to set a seed)
Question 14
How many observations there are in each cluster of km.ligue1 and km.ligue1.scaled ? (you can use table()). Do you obtain the same results when you perform kmeans() on the scaled and unscaled data?
## Paris-SG Monaco Lyon Marseille Rennes
## 2 2 2 2 1
## Bordeaux Saint-Etienne Nice Nantes Montpellier
## 1 1 1 1 1
## Dijon Guingamp Amiens Angers Strasbourg
## 1 1 3 3 3
## Caen Lille Toulouse Troyes Metz
## 3 3 3 3 3
##
## 1 2 3
## 8 4 8
## Paris-SG Monaco Lyon Marseille Rennes
## 2 2 2 2 1
## Bordeaux Saint-Etienne Nice Nantes Montpellier
## 1 1 1 1 1
## Dijon Guingamp Amiens Angers Strasbourg
## 1 1 3 3 3
## Caen Lille Toulouse Troyes Metz
## 3 3 3 3 3
##
## 1 2 3
## 8 4 8
PCA
Question 15
Apply PCA
on ligue1 dataset and store you results in pcaligue1
. Do we need to apply PCA on the scaled dataset? Justify your answer.
Question 16
Plot the observations and the variables on the first two principal components (biplot). Interpret the results.
There seem to be 4 tendencies for variables:
The “losing” ensemble, with a low Comp1 value
The “winning” ensemble, on the opposite, with a high Comp1 value
Draws, with a medium comp1 value, but a low comp2 value, which seems to be correlated with yellow cards
red cards, related to the sum of fouls against, with a high comp2 value.
Winning and losing are negatively correlated. Draws and Red cards are negatively correlated too.
For teams, the lower Comp1, the less likely they are to winning, and the lower Comp2, the more likely they are to have taken yellow cards, and have an important number of Draws.
Question 17
Visualize the teams on the first two principal components and color them with respect to their cluster.
Question 18
Recall that the figure of question 17 is a visualization with PC1 and PC2 of the clustering done with all the variables, not on PC1 and PC2. Now apply the kmeans()
clustering taking only the first two PCs instead the variables of original dataset. Visualize the results and compare with the question 17.
Implementing k-means
Question 19
Plot the observations.
Question 20
Randomly assign a cluster label to each observation. You can use the sample()
command in to do this. Report the cluster labels for each observation.
Question 21
Compute the centroid for each cluster.
CENTROIDE = function(dataset)
{
data_in_1 = 0
sum_X1_1 = 0
sum_X2_1 = 0
data_in_2 = 0
sum_X1_2 = 0
sum_X2_2 = 0
for (i in 1:nrow(dataset))
{
if (dataset$cluster_assignement[i] == "Cluster 1")
{
data_in_1 = data_in_1 + 1
sum_X1_1 = sum_X1_1 + dataset$X1[i]
sum_X2_1 = sum_X2_1 + dataset$X2[i]
}
else
{
data_in_2 = data_in_2 + 1
sum_X1_2 = sum_X1_2 + dataset$X1[i]
sum_X2_2 = sum_X2_2 + dataset$X2[i]
}
}
centroide_1 = c(sum_X1_1 / data_in_1 , sum_X2_1 / data_in_1)
centroide_2 = c(sum_X1_2 / data_in_2 , sum_X2_2 / data_in_2)
return(c(centroide_1, centroide_2))
}
Question 22
Create a function that calculates the Euclidean distance for two observations.
Question 23,24 and 25
Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster labels for each observation. Repeat 21 and 23 until the answers obtained stop changing. In your plot from 19, color the observations according to the cluster labels obtained.
same = FALSE
while (same == FALSE)
{
new_cluster_assignement = dataset$cluster_assignement
# calculer les nouveaux centroides
result = CENTROIDE(dataset)
cento1 = c(result[1], result[2])
cento2 = c(result[3], result[4])
# calsculer nouvelle affectation cluster
# pour tout les points du dataset
for (i in 1:nrow(dataset))
{
if (EUCLIDEAN_DISTANCE(cento1,c(dataset$X1[i],dataset$X2[i])) <
EUCLIDEAN_DISTANCE(cento2,c(dataset$X1[i],dataset$X2[i])))
{
new_cluster_assignement[i] = "Cluster 1"
}
else
{
new_cluster_assignement[i] = "Cluster 2"
}
}
# si cela n'a pas change, enterme de couleur
if (new_cluster_assignement == dataset$cluster_assignement)
{
same = TRUE
}
dataset$cluster_assignement = new_cluster_assignement
# afficher
plot(dataset$X1, dataset$X2, pch=19, cex=2,
col = ifelse(dataset$cluster_assignement == "Cluster 1",
color_palette[1],color_palette[3]))
points(result[1],result[2], col = color_palette[1], pch=3, lwd = 3, cex=2)
points(result[3],result[4], col = color_palette[3], pch=3, lwd = 3, cex=2)
Sys.sleep(2.5)
}
Hierarchical clustering on Iris dataset
Question 1
Download the iris dataset from here and import it into R.
Question 2
Choose randomly 40 observations of the iris dataset and store the sample dataset into sampleiris
.
Question 3
Calculate the euclidean distances between the flowers. Store the results in a matrix called D
. (Remark: the last column of the dataset is the class labels of the flowers)
Question 4
Construct a dendrogram on the iris dataset using the method average. Store the result in dendro.avg
.
Question 6
Plot again the dendrogram using the following command:
Question 7
To cut the dendrogram and obtain a clustering use the cutree
. You can choose the number of clusters you wish to obtain, or you can cut by choosing the height from the dendrogram figure. Cut the dendrogram in order to obtain 3 clusters. Store the results into vector groups.avg
.
Question 8
Visualize the cut tree using the function rect.hclust()
. You can choose the colors of the rectangles too!
Question 9
Compare the obtained results obtained with Hierarchical clustering and the real class labels of the flowers (function table()
). Interpret the results.
## groups.avg
## 1 2 3
## 12 2 26
##
## setosa versicolor virginica
## 11 15 14
Question 10
Now apply the Hierarchical clustering on the iris dataset (the 150 observations). Choose 3 clusters and compare the results with the real class labels. Compare different methods of Hierarchical clustering (average, complete and single linkages).
D_all = dist(dataset_IRIS[1:150,1:3])
dataset_IRIS_dendro.avg = hclust(D_all, method = "average")
plot(dataset_IRIS_dendro.avg)
plot(dataset_IRIS_dendro.avg, hang=-1, label=dataset_IRIS$class)
dataset_IRIS_groups.avg = cutree(dataset_IRIS_dendro.avg, k = 3)
rect.hclust(dataset_IRIS_dendro.avg, k = 3, border = 25)
## dataset_IRIS_groups.avg
## 1 2 3
## 50 88 12
##
## setosa versicolor virginica
## 50 50 50
plot(dataset_IRIS_dendro.comp, hang=-1, label=dataset_IRIS$class)
dataset_IRIS_groups.comp = cutree(dataset_IRIS_dendro.comp, k = 3)
rect.hclust(dataset_IRIS_dendro.comp, k = 3, border = 25)
## dataset_IRIS_groups.comp
## 1 2 3
## 50 70 30
##
## setosa versicolor virginica
## 50 50 50
plot(dataset_IRIS_dendro.sgl, hang=-1, label=dataset_IRIS$class)
dataset_IRIS_groups.sgl = cutree(dataset_IRIS_dendro.sgl, k = 3)
rect.hclust(dataset_IRIS_dendro.sgl, k = 3, border = 25)
## dataset_IRIS_groups.sgl
## 1 2 3
## 50 98 2
##
## setosa versicolor virginica
## 50 50 50