Principal Component Analysis Using RStudio
Principal Component Analysis
PCA holds immense potential in data interpretation by reducing complex data with various traits/variables into principal components. Once the important PC axes are plotted against each other, we can interpret with a better understanding. In plant breeder's point of view, PCA has been employed to assess the variability present among genotypes, trait-genotype interaction, and also it is helpful in assessing the interaction among the traits considered for the study.
Let's see how to do PCA biplots using RStudio and various commands that are useful for the purpose.
1. Sample Data
- Type the data as above. I have given a sample seedling data taken from maize.
- Unreplicated data is required for PCA
- Save the file in the excel file format itself
- The procedure for data import has been given in the previous blog about RStudio. Please do refer to that if you have any sorts of doubts.
3. R Commands for PCA biplot
3.1. Install the package factoextra:
install.packages(factoextra)
Note*:
- We can put in the above command or just we can go directly to the install option and search for the particular package. Upon selecting from the list RStudio will automatically install the package.
- Prior to "factoextra" the package "ggplot2" may be required, if it is not installed do install that first to avoid any errors.
library(factoextra)
3.3. Specifying the columns in data required for PCA:
data<-sample[c(2:9)]
Note*:
- "sample" is the name given to the excel file. Inside the file, we are only taking the values present in columns from two to nine. Thus, excluding the genotype column.
- Now the column specified data table is being saved under the name "data" for our convenience.
3.4. Computing PCA:
prcomp(data, scale. = TRUE)
pca<-prcomp(data, scale. = TRUE)
Note*:
- "prcomp" function gives you the PCA scores for all the traits selected and presents it in the form of a matrix. We can see the values in the R console region as shown in the above figure.
- Now this PCA matrix is being saved under the name "pca" for our convenience in further commands.
3.5. Summary Statistics of PCA:
summary(pca)
Note*:
- Summary statistics include standard deviation, the proportion of variance, and cumulative proportion.
fviz_eig(pca)
3.7. Graph representing genotypes (Genotypes with similarity are grouped together):
fviz_pca_ind(pca,col.ind = "cos2", gradient.cols = c("green", "yellow", "red"), repel = TRUE)
Note*:
- "cos2" command differentiates the genotypes with different colour variants and the colours assigned depends up on the quality of representation of each genotypes in the PC axes.
- "repel" command prevents text overlapping in the graph as much as possible.
3.8. Graph representing traits (Useful for assessing correlation):
fviz_pca_var(pca, col.var = "contrib", gradient.col = c("pink","purple", "blue"), repel = TRUE)
Note*:
- "contrib" command differentiates the traits/characters with different colour variants and the colours assigned depends up on the contribution of individual traits towards the PC axes.
3.9. Biplot of both genotypes and traits:
fviz_pca_biplot(pca, repel = TRUE, col.var = "red", col.ind = "blue")
4. R commands for accessing PCA values that were used in plotting these graphs:
4.1. Eigen values:
get_eigenvalue(pca)
4.2. Results for traits:
get_pca_var(pca)
res.var<-get_pca_var(pca)
Note: This will give you the suggestions for further commands to view specific result. I have renamed the first command to a shorter version as "res.var" for convenience.
res.var$coord
res.var$cor
res.var$cos2
res.var$contrib
Note:
- "res.var$coord" give values of the PCA coordinates for the traits.
- "res.var$cor" gives correlation between traits and PC axes.
- "res.var$cos2" give values on the quality of representation of traits.
- "res.var$contrib" give values for contributions of each traits towards PC axes.
4.3. Results for genotypes:
get_pca_ind(pca)
res.ind<-get_pca_ind(pca)
Note: This will give you the suggestions for further commands to view specific result of genotypes. I have renamed the first command to a shorter version as "res.ind" for convenience.
res.ind$coord
res.ind$cos2
res.ind$contrib
Note:
- "res.ind$coord" give values of the PCA coordinates for the genotypes.
- "res.ind$cos2" give values on the quality of representation of genotypes.
- "res.ind$contrib" give values for contributions of each genotypes towards PC axes.