Principal Components Analysis is a well known technique in dimensionality reduction. It aims to project points (elements of the dataset) in a new space by maximizing the variablility in the first axis of the new space. So the variance is considered as relevant information. The reasons to Reduce Dimensionality is to obtain quicker and more accurate results from machine learning methods. Usually to deal with computational issues with large numbers of predictors, the inability to use certain statistical methods when not enough observations per predictor exist (such as regression), noise conducting to less accurate results (model over-fitting). It is easier to visualize data and to data mine with fewer predictors. However, contrived examples suggest PCA cannot adequately reduce dimensionality without losing important information if the predictor subspace important to an outcome is not linear or there are inherent curves in these relationships. Let us investigate this technique with R studio.
setup
You can download the iris dataset here and import it as follow :
color_palette = c("lightpink", "aquamarine3","royalblue")
# Don't forget to change with your current working directory.
setwd("C:/PutYourDirectory")
set.seed(2)
iris = read.csv('iris.data')
To visualize the dataset :
View(data)
Exploratory analysis
First, let us investigate the dataset to understand its structure. What one can do is comparing the means and the quartiles of the 3 different flower classes for the 4 different features.
boxplot(sepal_length ~ iris$class, data = iris, col = color_palette, main = "Boxplot of the sepal length")
boxplot(sepal_width ~ iris$class, data = iris, col = color_palette, main = "Boxplot of the sepal width")
boxplot(petal_length ~ iris$class, data = iris, col = color_palette, main = "Boxplot of the petal length")
boxplot(petal_width ~ iris$class, data = iris, col = color_palette, main = "Boxplot of the sepal width")
To explore how the 3 different flower classes are distributed along the 4 different features, we will visualize them via histograms using the following code.
# Let's use the ggplot2 library
# ggplot2 is the most advanced package for data visualization
# gg corresponds to The Grammar of Graphics.
library(ggplot2) #of course you must install it first if you don't have it already
# histogram of sepal_length
ggplot(iris, aes(x=sepal_length, fill=class)) +
geom_histogram(binwidth=.2, alpha=.5)
# histogram of sepal_width
ggplot(iris, aes(x=sepal_width, fill=class)) +
geom_histogram(binwidth=.2, alpha=.5)
# histogram of petal_length
ggplot(iris, aes(x=petal_length, fill=class)) +
geom_histogram(binwidth=.2, alpha=.5)
# histogram of petal_width
ggplot(iris, aes(x=petal_width, fill=class)) +
geom_histogram(binwidth=.2, alpha=.5)
PCA using princomp()
Now we have visualized our data, we can apply a PCA on the Iris dataset using the princomp()
function.
pcairis=princomp(iris[,-5], cor=T)
Let us look at the result. What one can do to assess the principal component is to look the plot of the component in function of the variance. Looking at the elbow and the cumulative Proportion, PC1 and PC2 cover more than 95% of variability in the dataset. So if we only want to use the first two axis we won’t loose to much information.
str(pcairis)
## List of 7
## $ sdev : Named num [1:4] 1.706 0.96 0.384 0.144
## ..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
## $ loadings: 'loadings' num [1:4, 1:4] 0.522 -0.263 0.581 0.566 0.372 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "sepal_length" "sepal_width" "petal_length" "petal_width"
## .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
## $ center : Named num [1:4] 5.84 3.05 3.76 1.2
## ..- attr(*, "names")= chr [1:4] "sepal_length" "sepal_width" "petal_length" "petal_width"
## $ scale : Named num [1:4] 0.825 0.432 1.759 0.761
## ..- attr(*, "names")= chr [1:4] "sepal_length" "sepal_width" "petal_length" "petal_width"
## $ n.obs : int 150
## $ scores : num [1:150, 1:4] -2.26 -2.09 -2.37 -2.3 -2.39 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
## $ call : language princomp(x = iris[, -5], cor = T)
## - attr(*, "class")= chr "princomp"
summary(pcairis)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.7061120 0.9598025 0.38386622 0.143553848
## Proportion of Variance 0.7277045 0.2303052 0.03683832 0.005151927
## Cumulative Proportion 0.7277045 0.9580098 0.99484807 1.000000000
plot(pcairis)
Let us see the flower plot in the first two axis of the PCA.
biplot(pcairis)
Deeper PCA using factoextra package
We can use factoextra
and prcomp()
which is a built-in function in R to have better plots and interesting information on the PCA :
# loading factoextra library
library(factoextra)
require(factoextra)
# Calculating pca using prcomp()
res.pca <- prcomp(iris[,-5], scale = TRUE)
# Graph of individuals
fviz_pca_ind(res.pca,
col.ind = "contrib", # Color by their contribution to axes
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)
# Graph of variables
fviz_pca_var(res.pca,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)
# Biplot of individuals and variables
fviz_pca_biplot(res.pca, repel = TRUE,
col.var = color_palette[3],
col.ind = "#696969"
)
#Screeplot
screeplot(res.pca, type = "l")
var <- get_pca_var(res.pca)
# Variables contribution
head(var$contrib)
Dim.1 Dim.2 Dim.3 Dim.4
sepal_length 27.287211 13.86209637 51.986524 6.864169
sepal_width 6.935581 85.66548239 5.857991 1.540945
petal_length 33.785622 0.04449896 1.985063 64.184816
petal_width 31.991586 0.42792228 40.170422 27.410070
#For instance for the first axis
fviz_contrib(res.pca,
choice = c("var"),
axes = 1,
fill = color_palette[3],
color = color_palette[3],
sort.val = c("desc", "asc", "none"),
top = Inf,
xtickslab.rt = 45,
ggtheme = theme_minimal())
Step-by-step PCA
This is pretty satisfying to have package that can do such a great job. But to fully understand PCA, let us built it from scratch. First step, we split the iris dataset into data X and class labels y.
x <- iris[,-5]
y <- iris[,5]
Standardizing
This is not mandatory but Standardizing our data often makes sense. We scale the 4 features and store the scaled matrix into a new one (named X_scaled
).
X_scaled = scale(x)
Covariance Matrix
The classic approach to PCA is to perform the eigendecomposition on the covariance matrix \(Σ\), which is a \(p*p\) matrix where each element represents the covariance between two features. First we need to compute the mean vector.
#x_mean = (1/nrow(X_scaled))*t(X_scaled)%*%as.matrix(rep(1, dim(X_scaled)[1])) other mean to do this
x_mean = apply(X_scaled, 2, mean)
Let us compute the Covariance Matrix of the scaled features.
covariance_matrix = (1/(nrow(X_scaled)-1))*(t(X_scaled-x_mean)%*%(X_scaled-x_mean))
covariance_matrix
sepal_length sepal_width petal_length petal_width
sepal_length 1.0000000 -0.1093692 0.8717542 0.8179536
sepal_width -0.1093692 1.0000000 -0.4205161 -0.3565441
petal_length 0.8717542 -0.4205161 1.0000000 0.9627571
petal_width 0.8179536 -0.3565441 0.9627571 1.0000000
Then we perform an eigendecomposition on the covariance matrix and we compute the Eigenvectors and the Eigenvalues (using eigen()
function).
cov_decomp = eigen(covariance_matrix)
cov_decomp
eigen() decomposition
$values
[1] 2.91081808 0.92122093 0.14735328 0.02060771
$vectors
[,1] [,2] [,3] [,4]
[1,] 0.5223716 -0.37231836 0.7210168 0.2619956
[2,] -0.2633549 -0.92555649 -0.2420329 -0.1241348
[3,] 0.5812540 -0.02109478 -0.1408923 -0.8011543
[4,] 0.5656110 -0.06541577 -0.6338014 0.5235463
Principal component analysis using the standardized data is equivalent to principal component analysis using the correlation matrix. We obtain the eigenvalues, which are the variances of the features in the new subspace. \[\textbf{var}(Y_i) = \text{var}(a_{i1}X_1 + a_{i2}X_2 + \dots a_{ip}X_p) = \lambda_i\] And the eigenvectors, which point into the direction of the larger spread of data (the spread and the orientation).
Correlation Matrix
D <- diag(diag(covariance_matrix)^(-1/2))
corr_matrix <- D%*%covariance_matrix%*%D # correlation matrix
corr_matrix
[,1] [,2] [,3] [,4]
[1,] 1.0000000 -0.1093692 0.8717542 0.8179536
[2,] -0.1093692 1.0000000 -0.4205161 -0.3565441
[3,] 0.8717542 -0.4205161 1.0000000 0.9627571
[4,] 0.8179536 -0.3565441 0.9627571 1.0000000
We perform an eigendecomposition of the standardized data based on the correlation matrix.
eigen(corr_matrix)
eigen() decomposition
$values
[1] 2.91081808 0.92122093 0.14735328 0.02060771
$vectors
[,1] [,2] [,3] [,4]
[1,] 0.5223716 -0.37231836 0.7210168 0.2619956
[2,] -0.2633549 -0.92555649 -0.2420329 -0.1241348
[3,] 0.5812540 -0.02109478 -0.1408923 -0.8011543
[4,] 0.5656110 -0.06541577 -0.6338014 0.5235463
We perform an eigendecomposition of the raw data based on the correlation matrix and compare the obtained results with the previous steps.
X_raw_cormat <- cor(x)
eigen(X_raw_cormat)
eigen() decomposition
$values
[1] 2.91081808 0.92122093 0.14735328 0.02060771
$vectors
[,1] [,2] [,3] [,4]
[1,] 0.5223716 -0.37231836 0.7210168 0.2619956
[2,] -0.2633549 -0.92555649 -0.2420329 -0.1241348
[3,] 0.5812540 -0.02109478 -0.1408923 -0.8011543
[4,] 0.5656110 -0.06541577 -0.6338014 0.5235463
We obviously observe that all three approaches yield the same eigenvectors and eigenvalue pairs.
Explained Variance
We calculate the individual explained variation and the cumulative explained variation of each principal component.
We remember that \(\frac{\lambda_i}{\lambda_1 + \lambda_2 + \dots + \lambda_p}\) is the proportion of variation explained by the ith principal component.
first_variation = cov_decomp[['values']][1]/sum(cov_decomp[['values']])
first_variation
[1] 0.7277045
second_variation = cov_decomp[['values']][2]/sum(cov_decomp[['values']])
second_variation
[1] 0.2303052
third_variation = cov_decomp[['values']][3]/sum(cov_decomp[['values']])
third_variation
[1] 0.03683832
fourth_variation = cov_decomp[['values']][4]/sum(cov_decomp[['values']])
fourth_variation
[1] 0.005151927
cumlative_one = first_variation
cumlative_one
[1] 0.7277045
cumlative_two = cumlative_one + second_variation
cumlative_two
[1] 0.9580098
cumlative_three = cumlative_two + third_variation
cumlative_three
[1] 0.9948481
cumlative_four = cumlative_three + fourth_variation
cumlative_four
[1] 1
plot(seq(1,4),c(cumlative_one,cumlative_two,cumlative_three,cumlative_four), type='l',col=color_palette[3], xlab = 'Principal Component', ylab = 'Explained variation', main = 'Cumulative explained variation')
To plot the individual explained variation. (scree plot)
plot(seq(1,4), c(first_variation,second_variation,third_variation,fourth_variation), main = "Individual explained variation", xlab = 'Principal Component', ylab = 'Proportion of variance explained',xlim = c(1,4), ylim = c(0,1), type = "l", col=color_palette[3])
Projection Matrix
We construct as follow the projection matrix that will be used to transform the Iris data onto the new feature subspace.
projection_matrix = matrix(c(cov_decomp[['vectors']][,1],cov_decomp[['vectors']][,2]),ncol = 2)
Projection Onto the New Feature Space
We compute Y (Recall the Y is the matrix of scores, A is the matrix of loadings).
Y = matrix(as.double(unlist(x)),ncol=4)%*%projection_matrix
Visualization
And we plot the observations on the new feature space.
plot(Y[,1],Y[,2], xlab = 'PC1', ylab='PC2')
On the same plot, we color the observations (the flowers) with respect to their flower classes.
plot(Y[,1],Y[,2], xlab = 'PC1',ylab='PC2', main = "New Feature Space")
points(Y[,1],Y[,2],pch=20, cex = 1.5, col=ifelse(y=='setosa', color_palette[1],ifelse(y=='virginica',color_palette[2],color_palette[3])))
And well done !