23 March 2018

What is PCA – Principal Component Analysis and how to apply it to a dataset

Most of the time, the dataset we want to analyse is characterized by a huge number of attributes. The very high dimensional nature of many datasets makes graphical visualization impossible because, as humans, we are only able to analyse data in two- and three-dimensional space. In this scenario, reducing the dimensions of data without losing too much information can become very helpful when analysing datasets. When reducing N dimensions of data into 2 or 3 dimensions, we can have a graphical representation of data, facilitating its comprehension and exploitation in order to find patterns in data more easily. This technique known as the process of converting a set of data having vast dimensions into data with lesser dimensions ensuring that it conveys similar information concisely is called Dimension Reduction. Next, it will be addressed what is the Principal Component Analysis (PCA).

What is the Principal Component Analysis (PCA)?

One of the most common used techniques when reducing the dimensionality of data is a statistical method called Principal Component Analysis (PCA). The purpose of PCA is to identify dimensions, which explains the greatest amount of the variance in the data. In this way, we can find which dimensions differentiate better the dataset under analysis, in other words, it finds the principal components of the dataset.

With this technique is possible to highlight the similarities and differences that exist in data when identifying patterns. Finding patterns in data of high dimensions can be a very hard process, because the graphical representation is not available, so a visual analysis of data is not an option. When those patterns are identified in the dataset, the number of dimensions to analyse can be reduced without having a significant loss of information, because our focus will be the analysis of the main dimensions that characterize our dataset.

How to apply PCA to a dataset?

Let us suppose that we have a dataset with 2 dimensions (X and Y), as we can see in the picture below.

Figure 1 – Initial dataset.

Graphically, this data can be represented as follows:

 

Figure 2 – Graphical representation of dataset with 2 dimensions.

To apply the PCA to this dataset, you have to perform the following steps:

1st) Subtract the Mean

The first step when performing PCA consists in normalizing data, i.e., subtract the mean from each of the data dimensions from the dataset in order to produce a new dataset, whose mean is 0. For this example, we have to perform X-Mean(X)  and Y-Mean(Y) ,to obtain a new dataset.

Mean:

Figure 3 – Mean, data normalization and graphical representation of normalized data.

2nd) Calculate the Covariance Matrix

Variance and covariance are two measures used in statistics. Variance tells us how data is spread out from the mean, “ignoring” the existence of other dimensions, so it only operates on 1 dimension. Covariance, in its turns, is always measured between 2 dimensions, allowing us to find out how much the dimensions vary from the mean with respect to each other. With the covariance value between two dimensions we can deduce how related they are from each other, when carrying out the following analysis:

• cov(X,Y) > 0: if the covariance value between X and Y is positive, we can deduce that X and Y are related with each other, which means that if the X values increase, the Y values also increases.
• cov(X,Y) < 0: if the covariance value between X and Y is negative, we can deduce that X and Y are also related with each other, which means that if the X values increase, the Y values decreases. • cov(X,Y) = 0: if the covariance value between X and Y is zero, we can deduce that both dimensions are independent from each other, so the behaviour of dimension X does not tell us how the dimension Y behaves. Variance:

 

Covariance:

 

In order to understand how all dimensions from the dataset are related with each other, we have to calculate the covariance values among all dimensions from this set. A useful way to get all the possible covariance values between all the different dimensions is to calculate them all and put them in a matrix. For this particular example, as we have a 2 dimensional dataset, our covariance matrix will have 2 rows and 2 columns (matrix 2x2) and is represented as follows:

 

Note that cov(X,Y) = cov(Y,X) and cov(X,X) =

With new data from step 1) we create one matrix:

Figure 4 – Matrix with normalized dataset.

From the new matrix, we calculate the covariance matrix:

Figure 5 – Covariance matrix with normalized data

It is important to notice that cov(X,Y) < 0, so we should expect that when X variables increase, Y variables decrease.

3rd) Calculate the eigenvectors and eigenvalues of the covariance matrix

The eigenvectors and eigenvalues represent the main characteristics of a matrix. Regarding the eigenvectors, we can denote the following properties:

• Can only be found in square matrices (NxN)
• In a matrix NxN there are N eigenvectors
• All the eigenvectors of a matrix are orthogonal to each other
• They are calculated in order to find eigenvectors whose length is exactly one.

The first step to get the eigenvectors of a matrix is to calculate its eigenvalues. The eigenvalues are scalar values that can be obtained by solving the following equation:

 

Where A is the matrix for which we want to find the eigenvectors, I is the identity matrix and λ the eigenvalue(s) to be found.

How to calculate the determinant of a matrix?

This calculation is done using the following formula:

 

For instance, if we have a matrix 2×2, that   the det(A) is given by:

 

 

Using the covariance matrix from step 2), we get the following eigenvalues:

Figure 6 – Eigenvalues of covariance matrix.

Found the eigenvalues, we use the same ones to get their eigenvectors. Once again, by solving the following equation:

 

Being  the eigenvector calculated from the eigenvalue .

 

Figure 7 – Eigenvectors of covariance matrix.

The eigenvectors of covariance matrix provide us with information about the patterns in the data. By projecting these vectors on normalized data, it is easier to see this information.

 

Figure 8 – Projection of normalized data, oriented by its eigenvectors.

When analysing this plot, we could see that the eigenvector X2 goes through the middle of the points, like drawing a line of best fit. This is showing us how this dataset is related along that line. The second eigenvector X1 gives us another pattern in the data, showing that all the points follow the main line (represented by vector X2), but are off to the side of the main line by some amount. From this analysis, we can conclude that the eigenvector X2 is the main component, because this vector characterise data better than the eigenvector X1.

4th) Sort the eigenvectors in descending order of their eigenvalues

Once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives us the components in order of significance, in other words, we get the main components of the dataset.

In the previous step, we found that the eigenvector which best characterises the dataset of the example is the eigenvector X2. In fact, the eigenvalue of X2 (lambda 2) is greater than the eigenvalue of X1 (lambda 1) and, for this reason, X2 is the main component of the dataset.

Figure 9 – Eigenvalues and eigenvectors sorted in descending order.

5th) Choose p (p < N) main components from a dataset with N dimensions

Knowing the most important dimensions of the dataset, the lesser significant ones can be ignored, without losing too much information, if its eigenvalues are small. If we chose to leave out the less important components, the final data set will have less dimensions than the original.

This means that, if we originally have N dimensions in our data, after calculating its N eigenvectors and eigenvalues and their ordering, we choose only the first p components and, then, we get a new dataset with only p dimensions that will represent the original dataset. Therefore, we create a matrix formed by p eigenvectors, called Feature Vector (F).

6th) Transform the normalized dataset into a new dataset

Let F be the eigenvectors matrix sorted by the most important eigenvector from left to right and M the matrix that represent the normalized dataset:

Figure 10 – Feature Vector.

Figure 11 – Normalized dataset M.

We can use the Feature Vector  to transform normalized data , in order to be represented by a new coordinate system, given by the eigenvectors of the matrix F. This data transformation is done multiplying the transposed matrix of F  by the transposed matrix of M  . The result will be the matrix that represents the original dataset oriented according to the new axes defined by the eigenvectors.

Figure 12 – Transformed dataset into the new coordinate system (R), given by eigenvectors.

If we choose to keep two eigenvectors when transforming data, we will get the following plot:

 

Figure 13 – Dataset R, represented by 2 main components.

This plot is nothing more than the dataset transformed according to the axes defined by the eigenvectors, after both vectors being rotated to coincide with X and Y axes. Vector X2 had a rotation of 315°, while vector X1 had a rotation of 135° (clockwise rotation).

These data correspond to the original one, since we decided to keep both eigenvectors, we can see that there was no loss of information, only a transformation of the data into a new coordinate system.

However, if we choose to transform the data only maintaining the main component of this dataset, ignoring the second eigenvector, the data will be represented in a single dimension (Y=0). Therefore, we would have the following plot:

 

Figure 14 – Dataset R, represented by the main component.

How to get the original dataset back?

From the transformed data in the new coordinate system ,  it is possible to obtain the original dataset. If we decided to keep the 2 eigenvectors used in data transformation, we get exactly the original dataset back. However, if we have reduced the number of eigenvectors in the final transformation to one, then the retrieved data has lost some information.

By transforming the initial dataset into a new coordinate system, we used the following formula:

In order to get the original dataset ( ), we have adjusted the above formula for:

Since the matrix F represents the eigenvectors matrix, and, therefore, an orthogonal matrix, its inverse is equivalent to the transpose of . This matrix is equal to the matrix  . So, this formula can be rewritten as follows:

The matrix M represents the normalized dataset, that is, the dataset whose mean was subtracted. To get the actual original data back, we need to add on the mean of that original data which was subtracted, initially, in step 1:

Applying the formula used to recover the original dataset to the example and taking into account the matrix F with both eigenvectors, we obtain the following dataset:

Figure 15 – Initial dataset recovered from the transformed dataset.

This new dataset is exactly the same dataset we started with. As it is possible to see, there was no loss of information when recovering data, so the new obtained plot is the same as the initial one.

 

 

 

Figure 16 – Graphical representation of dataset recovered from transformed data (left plot). Graphical representation of initial dataset (right plot).

However, if in the matrix F we decided to keep only one eigenvector (main component) and ignore the second one, when recovering the initial dataset, we can notice that there is a small loss of information, giving rise to a plot not very different from the initial one.

Figure 17 – Initial dataset recovered from transformed data (1 main component).

 

Figure 18 – Graphical representation of initial dataset recovered from transformed data (1 main component, left plot). Graphical representation of initial dataset (right plot).

In summary, with this analysis, we can show, for a particular example, how PCA is able to transform the data into a new coordinate system, ensuring that:

• The first coordinate (1st main component) corresponds to the axis with the biggest variance, which means that, in this axis, data is more spread out.
• The second coordinate (2nd main component) corresponds to the axis with the second biggest variance.

And how, after applying PCA, we can understand which dimensions are the most relevant ones to characterise a given dataset. After analysing the result obtained from PCA, we could see that the least significant dimensions can be ignored without any significant loss of information. The graphical representation of the initial dataset recovered from the transformed data showed us that, by keeping only one eigenvector (the main component of this set).