Principal component analysis (PCA) is a statistical multivariate analysis technique for dimension reduction. In practice it is used when there are many correlated variables within a dataset, in order to reduce their number, losing the smallest possible amount of information.
PCA has precisely the aim to maximize variance, calculating the weight to be attributed to each starting variable in order to be able to concentrate them in one or more new variables (called principal components) which will be a linear combination of the starting variables.
- Variables must be Quantitative
- There must be a linear correlation between variables
The first thing to do when doing a PCA is to calculate the variance/covariance matrix or Pearson correlation matrix. PCA in fact is a technique that can be used when the assumptions of the Pearson linear correlation coefficient are respected. Pearson's correlation coefficients inform you about the direction and intensity of the linear relationship between phenomena. To interpret it, remember that the closer the coefficient is to zero, the weaker the relationship will be, the closer it gets to -1 or +1, the stronger the relationship will be. In PCA, acceptable values for this indicator are R>0.3 or R<-0.3. If a variable had correlation coefficients very close to 0 with all the other variables, then that variable should not be included in the PCA. Forcing that variable to merge with others will result in a very high loss of information and this is something that is generally better to avoid.
- Lack of outliers
Quite large sample size
- Sample adequacy
After verifying the dataset requirements, checking that the variables have the right characteristics to be able to conduct the principal component analysis, here are the different steps to conduct it:
Check the adequacy of the sample through:
- The Kaiser-Meyer-Olkin test, (KMO), which establishes whether the variables considered are actually consistent for the use of a principal component analysis. This index can take values between 0 and 1 and, in order for a principal component analysis to make sense, it must have a value at least greater than 0.5.
This index can be calculated as a whole for all the variables included in the PCA.
-Bartlett's sphericity test: it is a hypothesis test that has as null hypothesis that the correlation matrix coincides with the identity matrix. If so, it would make no sense to perform a PCA as it would mean that the variables are not linearly related to each other at all. As with all hypothesis tests, the value to stop at in order to decide whether to reject the null hypothesis or not is the p-value. In this case, for the model to be considered valid, a p-value lower than 0.05 must be achieved. In this case, in fact, the null hypothesis can be rejected with a significance level of 5%.
The crucial part of PCA is to establish the adequate number of factors that can best represent the starting variables.
To better understand this concept, imagine that your dataset is a city you don't know, and each major component is a street in this city. If you wanted to get to know this city, how many streets would you visit? You would probably start from the central street (the first main component) and then explore other streets. How many though?
In order to say that you know a city well enough, the number of streets to visit varies according to the size of the city and how similar or different the streets are, obviously. Similarly, the number of components to extract depends on how many variables you choose to include in your principal component analysis and how similar they are to each other. In fact, the more correlated they are, the lower the number of principal components necessary to obtain a good knowledge of the starting variables. Conversely, the less they are correlated, the greater the number of principal components to be extracted in order to have accurate information about the dataset.
The criteria used for choosing the number of components are essentially two: eigenvalues greater than 1 and parallel analysis.
- Eigenvalues greater than 1
- Proportion of Explained Variance
Following this criterion, the components to be extracted must ensure that at least 70% of the overall variability of the starting variables is not lost. Furthermore, each single component to be extracted should bring a significant increase in the overall variance (for example, at least 5% or 10% more of explained variability).
- Scree Plot
As you can see from the graph, the components are listed on the x axis, whereas the eigenvalues are on the y axis. When the curve on this graph makes an "elbow" it's time to draw a line, and take into consideration only the factors which are above.
From the graph above, for example, you can see that the number of points above the elbow is 2.
The final part of PCA consists in giving a name to the individual main components found.
With statistical software (such as SPSS, Jamovi and R) PCA is a very simple operation. A few clicks are enough to be able to obtain an output to be interpreted. There is therefore no software that is preferable to the others as it is a widely used technique and all statistical programs allow it to be performed easily and without having to carry out any hand calculation. However in this module we will show how to conduct PCA with the R software.
The whole process to implement PCA on R will be represented in the power point attached to this module, namely:
In this module we will just present the FactoMineR package.
FactoMineR is able to carry out principal components analysis by reducing the dimensionality of the multivariate data to two or three, which can thus be displayed graphically with a minimum loss of information and this can be done using a single command, that is PCA, we will insert the matrix object of analysis between parentheses
X <- as.matrix(DATASET)
res.pca = PCA(DATASET)
With the summary command we can see the importance of the components in terms of standard deviation, proportion of explained variance and cumulative explained variance, both for individuals and for variables.
With the head command
instead, you can calculate the importance of the eigenvalues. The command, in fact, will give us the values of the eigenvalues, the percentage of the explained variance and the cumulative explained variance for each variable.
Example of what we will see on R
Finally, in order to be able to draw the scree-plot of the eigenvalues, we will insert the object of analysis between parentheses
barplot(res.pca$eig[,1], main="Eigenvalues’ scree-plot")
With the Main command we’ll indicate the title of the graph.
Example of what we will see on R
Another useful package for PCA (we won't cover it in this module though) is factoextra, which provides some easy-to-use functions to extract and visualize the results we get from multivariate analyses, including PCA (principal component analysis), CA (simple correspondence analysis), MCA (multiple correspondence analysis), MFA (multiple factor analysis), HMFA (hierarchical multiple factor analysis).
PCA, Correlation, quantitative variables, explained variance, eigenvalues.Objectives/goals:
This module aims to introduce and explain Principal Component Analysis technique
By the end of this module you will be able to:
- Know the logic of PCA;
- Know the requirements
- conduct a PCA
-conduct a PCA in R with the FactorMineR package
In this training module, the multidimensional analysis technique called Principal Components Analysis (PCA) will be presented, whose objective is to reduce the dimensionality of a phenomenon under investigation while preserving the information contained in it. The technique is applicable to phenomena measured with quantitative variables, thus distinguishing itself from other dimensionality reduction techniques, such as simple correspondence analysis (CA) or multiple correspondence analysis (MCA), developed for the analysis of qualitative variables.
The last part of the module will be dedicated to the application of PCA with R.
Pozzolo P., Analisi delle componenti principali: da dove partire, https://paolapozzolo.it/analisi-delle-componenti-principali-criteri/
Gilardone A., Analisi delle componenti principali: 7 passaggi da eseguire https://adrianogilardone.com/analisi-delle-componenti-principali/
Gilardone A., https://www.youtube.com/watch?v=OksC-g4K2gY
Vardanega A., L’Analisi in componenti principali
Zakaria Jaadi, A Step-by-Step Explanation of Principal Component Analysis (PCA), https://builtin.com/data-science/step-step-explanation-principal-component-analysis
Ian T. Jolliffe and Jorge Cadima, Principal component analysis: a review and recent developments, https://royalsocietypublishing.org/doi/10.1098/rsta.2015.0202
Science Snippets Blog, What Is Principal Component Analysis (PCA) and How It Is Used?, 2020 https://www.sartorius.com/en/knowledge/science-snippets/what-is-principal-component-analysis-pca-and-how-it-is-used-507186