I recently had an interesting problem that I think I successfully solved using principal component analysis (PCA). PCA is one of a family of multivariate statistical methods known as ordinations. Frankly it, and related techniques, scare me a little. They’re just a little too easy to implement in your favorite statistical workspace without any idea how to interpret the output. Ideally one would spend some time under the hood, getting comfortable with the inner workings of the method before applying it. I’m not the best at symbolic reasoning, so in these situations I usually rely on some application examples to help me make the jump from the stats book to my analysis. Often, however, I’m trying to do something a little bizarre for which I can’t find a great guiding example. That was how my week started, but I think it ended up in a pretty good place.
My problem breaks down like this. I’ve got 2,700 observations (genomes) described by 3.2 million variables (never mind what the variables are at the moment – could be anything). I want to calculate the Bray-Curtis distance between these genomes. I’ve got 32 Gb of RAM, performing distance matrix calculations on a 9 billion cell matrix isn’t going to happen (I tried anyway, using R and in Python using pdist from scipy.spatial, fail). But what if I don’t need all 3.2 million variables? It’s reasonable to assume that many, probably the vast majority, of these variables aren’t contributing much to the variance between genomes. If I can get rid of these superfluous variables I’ll have a matrix small enough to perform the distance calculations.
Enter PCA. PCA essentially generates artificial variables (axis – or principal components) to describe the variance between samples. The original variables can be described in terms of their magnitude, or contribution to each principal component. The principal components are ordered by how much variance they account for. In ecology you frequently see plots of the first two principal components (the two that account for the largest variance), with arrows depicting vectors that describe the magnitude of the true variables along both principal components.
In PCA you end up with as many PCs as you had samples to start with. Some of those principal components aren’t describing much in the way of variance. You can evaluate this with a scree plot. Here’s a scree plot for the 200 PCs in my analysis:
There are two inflection points on this plot. Normally you would be able to ignore all but the first few principal components (i.e., before the first inflection point), but in this case that isn’t accounting for much variance – maybe 30 %. I don’t think there’s any harm done by retaining additional PCs, other than a larger matrix size, so I chose to use all PCs to the left of the second inflection point, marked with the vertical line. That accounts for about 90 % of the variance. Note that this hasn’t done anything to reduce the number of variables in the original analysis, however.
For that I needed to evaluate a second scree plot, of the sum of the magnitudes for each variable, for the PCs that I elected to keep. In my case the matrix columns are PCs, the rows are variables, so I’m simply summing across the rows. Before doing this I normalize each magnitude by multiplying it by the proportion of variance accounted for by the relevant principal component (note that magnitude = absolute value of the variable). Here’s the second scree plot:
Again, there’s an obvious inflection point marked by the vertical line. It suggests that there are 100,000 variables that are most influential on the PCs that account for most of the variability. In short, I can use these 100,000 variables and ditch the other 3.1 million. Are distance matrices calculated from both the full and partial set of variables comparable? As the plot below shows, they aren’t quite linearly correlated, but there is very good agreement between the two calculations (R = 0.97).
One final note on this. I ran the PCA and distance matrix calculations in both R and Python. In Python I used scipy.spatial.pdist for the distance matrix and matplotlib.mlab.PCA for PCA. In R I used prcomp and vegdist. Anecdotally the memory utilization seemed to be quite a bit lower with Python, though I wasn’t watching it that closely.