1. Main problem

Sometimes you have a huge amount of variables. So, to make your data profitable you need to reduce number of variables saving without losing the precious information.

2. Data

I will use a dataset from [Huttenlocher, Vasilyeva, Cymerman, Levine 2002]. Authors analysed 46 pairs of mothers and children (aged from 47 to 59 months, mean age – 54). They recorded and trinscribed 2 hours from each child per day. During the study they collected number of noun phrases per utterance in mother speech to the number of noun phrases per utterance in child speech.

3. PCA

PCA is essentially a rotation of the coordinate axes, chosen such that each successful axis captures as much variance as possible. We can reduce 2 dementions to one using a regression:

We used regression for predicting value of one variable by another variable.

In PCA we change coordinate system and start predicting variables’ values using less variables.

So the blue line is the first Princple Component (and it is NOT a regression line). The number of the PCs is always equal to the number of variables. So we can draw the second PC:

So the main point of PCA is that if cumulative proportion of explained variance is high we can drop some PCs. So, we need know the following things:

summary(prcomp(df))
## Importance of components:
##                           PC1    PC2
## Standard deviation     0.2544 0.1316
## Proportion of Variance 0.7890 0.2110
## Cumulative Proportion  0.7890 1.0000

So, PC1 explains only 78.9% of the variance in our data.

df <- read.csv("../../data/Huttenlocher.csv")
prcomp(df)
## Standard deviations (1, .., p=2):
## [1] 0.2543899 0.1315688
## 
## Rotation (n x k) = (2 x 2):
##              PC1        PC2
## child  0.6724959 -0.7401009
## mother 0.7401009  0.6724959

So the formula for the first component rotation is \[PC1 = 0.6724959 \times child + 0.7401009 \times mother\] The formula for the second component rotation is \[PC2 = -0.7401009 \times child + 0.6724959 \times mother\]

From now we can change the axes:

The autoplot() function from ggfortify package produces nearly the same graph:

3D example by Ilya Schurov

Math behind the PCA

The main math technic that is used in PCA is finding eigenvalues and eigenvectors. This is a simple piece of math, but you need to have a good background in linear algebra (here is a good course with nice visualisations).

R code example

We will use data from the novel by P. Wodehouse “The Code of the Woosters”. I collected the frequency of some names according to different chapters:

wodehouse <- read.csv("https://raw.githubusercontent.com/LingData2019/LingData/master/data/wodehouse_pca.csv")
wodehouse
chapter Harold Gussie Dahlia Jeeves Madeline Oates Spode Stiffy sir
Chapter 01 0.0000000 0.0024979 0.0012490 0.0041632 0.0010408 0.0000000 0.0000000 0.0002082 0.0041632
Chapter 02 0.0000000 0.0018984 0.0011865 0.0090176 0.0016611 0.0000000 0.0011865 0.0004746 0.0099668
Chapter 03 0.0000000 0.0045368 0.0008781 0.0014635 0.0035124 0.0000000 0.0058539 0.0001463 0.0021952
Chapter 04 0.0030370 0.0026229 0.0001380 0.0013805 0.0011044 0.0015185 0.0017946 0.0030370 0.0013805
Chapter 05 0.0002109 0.0029523 0.0014762 0.0082244 0.0000000 0.0002109 0.0040067 0.0014762 0.0113876
Chapter 06 0.0000000 0.0051960 0.0000000 0.0014171 0.0033066 0.0000000 0.0075579 0.0014171 0.0004724
Chapter 07 0.0000000 0.0040664 0.0016943 0.0052525 0.0005083 0.0000000 0.0108438 0.0015249 0.0047442
Chapter 08 0.0024439 0.0006110 0.0001527 0.0054987 0.0000000 0.0004582 0.0004582 0.0070261 0.0070261
Chapter 09 0.0022802 0.0007601 0.0002534 0.0007601 0.0007601 0.0022802 0.0000000 0.0035470 0.0030403
Chapter 10 0.0000000 0.0034056 0.0000000 0.0027864 0.0018576 0.0003096 0.0040248 0.0027864 0.0049536
Chapter 11 0.0000000 0.0057385 0.0007485 0.0064870 0.0014970 0.0022455 0.0027445 0.0002495 0.0129741
Chapter 12 0.0000000 0.0040197 0.0062528 0.0107191 0.0000000 0.0008933 0.0013399 0.0000000 0.0093792
Chapter 13 0.0021858 0.0007286 0.0014572 0.0076503 0.0000000 0.0018215 0.0010929 0.0014572 0.0094718
Chapter 14 0.0000000 0.0005652 0.0022607 0.0067822 0.0007536 0.0033911 0.0022607 0.0011304 0.0160136
library(GGally)
ggpairs(wodehouse[,-1])

PCA <- prcomp(wodehouse[,-1])
PCA
## Standard deviations (1, .., p=9):
## [1] 0.0056971923 0.0034593165 0.0021253926 0.0018379315 0.0011487786
## [6] 0.0009801512 0.0007123842 0.0005034320 0.0002181423
## 
## Rotation (n x k) = (9 x 9):
##                  PC1         PC2         PC3        PC4         PC5
## Harold    0.04446386 -0.26045490  0.07123814 -0.2007927  0.19032360
## Gussie    0.08779348  0.38160590 -0.15116503  0.2628203 -0.33907652
## Dahlia   -0.15615004  0.12382054 -0.44598913 -0.1200921  0.57733956
## Jeeves   -0.50150883  0.17557608 -0.44773479 -0.4588859 -0.25194436
## Madeline  0.11753413  0.11397729  0.02832850  0.3441355 -0.25226081
## Oates    -0.08429973 -0.11600713  0.21296940  0.2130548  0.52978937
## Spode     0.21798564  0.77103418  0.39697535 -0.3648273  0.21338990
## Stiffy    0.07986552 -0.33105390  0.36958846 -0.5731985 -0.24597480
## sir      -0.79975269  0.09921106  0.48203526  0.2003306 -0.04069356
##                  PC6         PC7         PC8         PC9
## Harold   -0.21673241  0.62957491 -0.34447525 -0.53532218
## Gussie   -0.78159891  0.13186661  0.10619326 -0.01443264
## Dahlia   -0.25790782 -0.50008339 -0.12601954 -0.28487803
## Jeeves    0.07393053  0.25634369 -0.25181765  0.32947429
## Madeline  0.13479280 -0.23614522 -0.84496615 -0.04925650
## Oates    -0.27686794  0.18859046 -0.22734885  0.66640912
## Spode     0.11194383  0.06916025 -0.06534478 -0.01343150
## Stiffy   -0.39980733 -0.40770488 -0.13773815  0.11544756
## sir      -0.05427500 -0.09007671  0.04324016 -0.25194682

How to interpret this:

\[PC1 = Harold \times 0.03548428 + Gussie \times 0.08477226 + Dahlia \times -0.11013760 + Jeeves \times -0.48849572 +\]

\[ + Madeline \times 0.12377778 + Oates \times -0.04712363 + Spode \times 0.09814424 + Stiffy \times 0.05838698 + sir \times -0.84274152\]

What is the amount of the explained variance by each PC?

summary(PCA)
## Importance of components:
##                             PC1      PC2      PC3      PC4      PC5
## Standard deviation     0.005697 0.003459 0.002125 0.001838 0.001149
## Proportion of Variance 0.585790 0.215970 0.081530 0.060960 0.023820
## Cumulative Proportion  0.585790 0.801760 0.883290 0.944250 0.968070
##                              PC6       PC7       PC8       PC9
## Standard deviation     0.0009802 0.0007124 0.0005034 0.0002181
## Proportion of Variance 0.0173400 0.0091600 0.0045700 0.0008600
## Cumulative Proportion  0.9854100 0.9945700 0.9991400 1.0000000

That means that first two components explain 80% of data variance.

wodehouse_2 <- wodehouse[,-1]
rownames(wodehouse_2) <- wodehouse[, 1] # this is names
PCA <- prcomp(wodehouse_2)

Visualisation from package ggfortify:

library(ggfortify)
p1 <- autoplot(PCA,
               shape = FALSE,
               loadings = TRUE,
               label = TRUE,
               loadings.label = TRUE)
p1

Numbers on the graph are chapters, red lines are old coordinate axes. This kind of graphs are called biplots. Angle between old axes represent correlation between variables: cosine of this angle actually correspond to Pearson’s correlation coefficient.

Lets transpose

wodehouse <- read.csv("https://raw.githubusercontent.com/LingData2019/LingData/master/data/wodehouse_pca.csv")
w2 <- t(wodehouse[,-1])
colnames(w2) <- wodehouse$chapter
PCA <- prcomp(w2)
PCA
## Standard deviations (1, .., p=9):
## [1] 1.002254e-02 4.928666e-03 2.965444e-03 2.412650e-03 1.821397e-03
## [6] 1.035342e-03 9.133533e-04 2.802319e-04 4.540828e-19
## 
## Rotation (n x k) = (14 x 9):
##                    PC1          PC2         PC3           PC4         PC5
## Chapter 01 -0.15263466 -0.001513558  0.16535932  0.0430770832  0.34268701
## Chapter 02 -0.37204371 -0.020028219  0.11657983  0.1312454455  0.23722864
## Chapter 03 -0.01813188  0.412179484  0.02161418 -0.1783410503  0.19161698
## Chapter 04  0.02441164  0.001669766 -0.22045109  0.1049453112  0.20181568
## Chapter 05 -0.39093914  0.116253799 -0.07536677  0.1317848140 -0.04688382
## Chapter 06  0.03752427  0.528501109 -0.10885786  0.0032358120  0.20930975
## Chapter 07 -0.14486142  0.598834670 -0.10087156  0.2998830378 -0.50483289
## Chapter 08 -0.19292324 -0.219991536 -0.53298331  0.5217074756  0.17624182
## Chapter 09 -0.02077421 -0.168481276 -0.31175833 -0.0329682455  0.04977489
## Chapter 10 -0.11809211  0.208994394 -0.26018514  0.0004857454  0.21505108
## Chapter 11 -0.39025974  0.102923750 -0.05776624 -0.4589155984  0.37693465
## Chapter 12 -0.36695065 -0.034963787  0.62648064  0.3208994523  0.03227427
## Chapter 13 -0.31376031 -0.158927691 -0.03649267  0.1348735652 -0.14102356
## Chapter 14 -0.47448653 -0.143999604 -0.19022859 -0.4751375701 -0.45297015
##                    PC6          PC7         PC8         PC9
## Chapter 01  0.02254613  0.049438050 -0.02672354 -0.63345991
## Chapter 02  0.65937984 -0.119509193  0.23291480  0.09273922
## Chapter 03  0.21796508  0.129372286 -0.65616036 -0.03178744
## Chapter 04 -0.27006099 -0.432382171 -0.26919800 -0.31204963
## Chapter 05 -0.05938171 -0.081486700 -0.42658102  0.52136899
## Chapter 06  0.11506359  0.037137727  0.36979229  0.13324975
## Chapter 07 -0.16031063 -0.131699715  0.15051934 -0.22423885
## Chapter 08 -0.02837714  0.316839378 -0.07858926 -0.07164758
## Chapter 09 -0.21208608  0.002688693  0.17909448  0.24436028
## Chapter 10 -0.01912647  0.325903487  0.15509613 -0.06738685
## Chapter 11 -0.41365980 -0.220505960  0.19045580  0.08852911
## Chapter 12 -0.37842825  0.272362516  0.02452666  0.04933785
## Chapter 13  0.17522179 -0.576996960  0.02100575 -0.08191341
## Chapter 14  0.10308361  0.303118096 -0.03204157 -0.25595046
summary(PCA)
## Importance of components:
##                            PC1      PC2      PC3      PC4      PC5
## Standard deviation     0.01002 0.004929 0.002965 0.002413 0.001821
## Proportion of Variance 0.69440 0.167920 0.060790 0.040240 0.022930
## Cumulative Proportion  0.69440 0.862320 0.923110 0.963350 0.986280
##                             PC6       PC7       PC8       PC9
## Standard deviation     0.001035 0.0009134 0.0002802 4.541e-19
## Proportion of Variance 0.007410 0.0057700 0.0005400 0.000e+00
## Cumulative Proportion  0.993690 0.9994600 1.0000000 1.000e+00
p2 <- autoplot(PCA,
               shape = FALSE,
               loadings = TRUE,
               label = TRUE,
               loadings.label = TRUE)
p2

library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

Scale

wodehouse <- read.csv("https://raw.githubusercontent.com/LingData2019/LingData/master/data/wodehouse_pca.csv")
wodehouse_2 <- wodehouse[,-1]
rownames(wodehouse_2) <- wodehouse[, 1] # this is names
PCA <- prcomp(wodehouse_2, scale. = TRUE)
p3 <- autoplot(PCA,
               shape = FALSE,
               loadings = TRUE,
               label = TRUE,
               loadings.label = TRUE)
w2 <- t(wodehouse[,-1])
colnames(w2) <- wodehouse$chapter
PCA <- prcomp(w2, scale. = TRUE)
PCA
## Standard deviations (1, .., p=9):
## [1] 2.696462e+00 1.894978e+00 1.388132e+00 6.933491e-01 6.520839e-01
## [6] 4.583448e-01 2.869778e-01 1.134008e-01 1.766160e-16
## 
## Rotation (n x k) = (14 x 9):
##                    PC1          PC2         PC3          PC4         PC5
## Chapter 01 -0.33392360  0.004679004  0.14445545  0.353120578  0.43864868
## Chapter 02 -0.36116208 -0.023427004  0.07800706  0.127126764  0.05577386
## Chapter 03 -0.04793047  0.507989028 -0.07065034 -0.143365451  0.25822609
## Chapter 04  0.08283463 -0.063290857 -0.63129744  0.455155631  0.21643966
## Chapter 05 -0.36668174  0.044838120 -0.03752560  0.007801634 -0.13877691
## Chapter 06  0.03204635  0.501025735 -0.20865848  0.036617111  0.03839557
## Chapter 07 -0.16428764  0.407076034 -0.12679807  0.041279332 -0.61262049
## Chapter 08 -0.24155336 -0.249917324 -0.34794003  0.216249297 -0.30399299
## Chapter 09 -0.05331922 -0.396691972 -0.42480044 -0.352951285  0.07187230
## Chapter 10 -0.25792305  0.253531030 -0.34965565 -0.165726377  0.08159986
## Chapter 11 -0.34644165  0.045711362 -0.03128269 -0.239166957  0.39472973
## Chapter 12 -0.32165561 -0.014463851  0.28694271  0.358244924 -0.05740826
## Chapter 13 -0.34564612 -0.152082218  0.02294614  0.047723457 -0.16754085
## Chapter 14 -0.34022715 -0.095360637  0.02961940 -0.491627189 -0.05915694
##                    PC6         PC7         PC8         PC9
## Chapter 01 -0.17794738 -0.07677195  0.07433269  0.27462609
## Chapter 02 -0.11139922  0.55607332  0.23924431 -0.07705773
## Chapter 03 -0.06782500  0.18817420 -0.61780181  0.21435785
## Chapter 04  0.47738149  0.03218674 -0.15511604 -0.02114225
## Chapter 05  0.12502152 -0.01624812 -0.26894078 -0.24942702
## Chapter 06 -0.10432274  0.07368085  0.49968394  0.42669343
## Chapter 07  0.26172430 -0.18266734  0.11653731  0.08524616
## Chapter 08 -0.51436696  0.03982797 -0.23458641  0.18861643
## Chapter 09 -0.05090431 -0.20485388  0.18432763  0.30111887
## Chapter 10 -0.38579552 -0.17232747  0.11118222 -0.57341813
## Chapter 11  0.30627562 -0.22141510  0.18668528 -0.14977046
## Chapter 12  0.03269566 -0.55050939 -0.06095912  0.12914426
## Chapter 13  0.30018261  0.42832186  0.12363312 -0.02560732
## Chapter 14  0.16562682  0.02428583 -0.20169562  0.35479015
summary(PCA)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     2.6965 1.8950 1.3881 0.69335 0.65208 0.45834
## Proportion of Variance 0.5193 0.2565 0.1376 0.03434 0.03037 0.01501
## Cumulative Proportion  0.5193 0.7759 0.9135 0.94782 0.97819 0.99320
##                            PC7     PC8       PC9
## Standard deviation     0.28698 0.11340 1.766e-16
## Proportion of Variance 0.00588 0.00092 0.000e+00
## Cumulative Proportion  0.99908 1.00000 1.000e+00
p4 <- autoplot(PCA,
               shape = FALSE,
               loadings = TRUE,
               label = TRUE,
               loadings.label = TRUE)

library(gridExtra)
grid.arrange(p3, p4, ncol = 2)

Summary:

R functions

There are several functions for PCA, MCA and their visualisation.

Lab