One of the chapters of my dissertation was just rejected for publication a second time. I haven’t decided what to do with it yet, but in the meantime it makes an interesting case study in rejection. In brief, what we tried to do in this paper was make a large scale statistical inference of amino acid composition in a taxonomically controlled group of psychrophiles (cold adapted bacteria) and mesophiles (room temperature adapted bacteria). A protein derives its physical characteristics (structure, flexibility, substrate specificity, etc.) from the physiochemical properties of each amino acid (size, charge, solubility in water, etc.). As proteins in psychrophiles evolve it is therefor reasonable to expect that they will preference certain amino acids associated with the structural modifications most likely to enhance protein function at low temperature. Because protein evolution, like all evolution, is a randomized process, we expect differences in amino acid composition between the two groups to be small and messy, but identifiable with a large enough sample size.
A fair bit of previous work has been done in this area, including Metpally and Reddy, 2009. There are, however, some issues with that analysis. Metpally and Reddy used a small number of predicted proteomes from psychrophiles and mesophiles (it was 2009, after all), and did not control for multiple comparisons in the large number of Students t-tests that they performed between these two groups. We hoped to improve on this with a larger, carefully controlled set of predicted proteomes and a more robust statistical approach.
The idea that differences in amino acid composition reflect adaptation to different thermal regimes is not popular in all circles. In our first submission we received two reviews, one positive and one negative. The negative review came from a protein chemist who does very high resolution dynamical studies of proteins as they interact with their substrates. He noted that it takes only a single amino acid to effect a major change in, say, the flexibility of a protein and thus its function under different temperatures. This is true, but I would counter that evolution doesn’t know to target a specific high-payoff residue for mutation. This process is random; every now and again a residue in a key dynamical position will get modified in a way that improves protein function. Much more often residues in less key positions will mutate in a way that helps just a little bit. We would expect to find a consistent pattern in amino acid substitutions across psychrophile proteins as a result of these mutations.
After some minor changes we resubmitted the manuscript to a different journal. I give the second editor much credit for going the distance and ultimately soliciting five different reviews. The reviews were roughly 3 in favor and 2 against; usually one dissenter is enough to prevent publication. One of the objections raised, and the subject of this post, was our use of the t-test and p-value to describe the statistical significance of differences in amino acid composition between psychrophiles and mesophiles. I usually think of large sample size as a good thing as it allows the detection of correlations that would otherwise be masked by noise or other effects. In this case, however, the large number of proteins in our dataset seems to have gotten us into some trouble.
We used predicted proteomes from 19 psychrophiles and 19 mesophiles, roughly 130,000 proteins total. We made over 100 tests for differences in parameters (such as amino acid composition) between the two groups, applying the Holm-Bonferroni method to control for the multiple comparisons. After control we had some number of parameters that exceeded our (Holm-Bonferroni corrected) threshold for significance. A t-test of serine content, for example, yielded a p-value of essentially 0. The mean proportion of serine, however, was not that different between the groups, at 0.0658 for the psychrophiles and 0.0636 for the mesophiles. We considered these slight preferences as consistent with our ideas of protein evolution and reported them.
To quote an anonymous reviewer: “Any inferential statistics test, being it parametric or non-parametric, reaches any desired statistical significance level at increasing N (sample size). This makes the issue of biological relevance drastically separated from the problem of statistical significance (any difference between two samples becomes significant at increasing N)“. The reviewer is concerned that the very small differences that we observed, while statistically significant, are not biologically relevant. They go on to recommend the treatment of this issue in Kraemer, 1992. That paper, which draws on a hypothetical medical study using a t-test between a treatment and control group, makes some interesting points. I’ll attempt to reconstruct what they do in the context of differences in serine content between the two groups, and end with what I think it means for our study and what we might do about it.
First we can read in some data and take a look at the distributions (if you want to play with the code just use rnorm to generate some normally distributed data).
serine <- read.table('serine.txt.gz', header = F)
serine_treat <- serine[which(serine[,1] == 'cold'),2]
serine_control <- serine[which(serine[,1] == 'control'),2]
## look at the distributions
treat_hist <- hist(serine_treat, breaks = 100)
control_hist <- hist(serine_control, breaks = 100)
plot(control_hist$counts ~ control_hist$mids,
col = 'red',
type = 'l',
ylab = 'Count',
xlab = 'Serine composition')
points(treat_hist$counts ~ treat_hist$mids,
type = 'l')
lines(c(mean(serine_treat), mean(serine_treat)),
c(-1000,8000),
lty = 2)
lines(c(mean(serine_control), mean(serine_control)),
c(-1000,8000),
col = 'red',
lty = 2)
legend('topright',
legend = c('Treatment', 'Control', 'Treatment mean', 'Control mean'),
col = c('black', 'red', 'black', 'red'),
lwd = 1,
lty = c(1,1,2,2))
That produces the following plot. Notice how close the two means are. They are very statistically significantly different by the t-test, but is this biologically meaningful? The Kraemer paper uses the term effect size to describe biological impact.
Central to the Kraemer ideas of effect size is a plot of failure rate (in our analysis the fraction of the control group that is above a given value of the control group) as a function of the control values. The more different the failure rates for the treatment and control, the greater the effect size (the more biologically meaningful the result, or so the thinking goes).
## failure rate analysis
lc <- length(serine_control)
lt <- length(serine_treat)
sr <- matrix(ncol = 3, nrow = lt)
i <- 0
for(st in sort(serine_treat)){
i <- i + 1
print(i)
fr_sc <- length(which(serine_control < st)) / lc
fr_st <- length(which(serine_treat < st)) / lt
sr[i,] <- c(st, fr_sc, fr_st)
}
plot(sr[,3] ~ sr[,1],
ylab = 'Failure rate',
xlab = 'Response value',
type = 'l')
points(sr[,2] ~ sr[,1],
col = 'red',
type = 'l')
legend('bottomright',
legend = c('Treatment', 'Control'),
col = c('black', 'red'),
lwd = 1)
And that, in turn, produces the following plot of failure rate.
Note that treatment value is synonymous with serine composition. I’m trying to keep the nomenclature somewhat consistent with Kraemer, 1992. From this plot we can see that the failure rate is pretty similar for a given treatment value for the two groups, but not exactly the same. This can be explored further as a plot of failure rate difference as a function of treatment value.
plot((sr[,2] - sr[,3]) ~ sr[,1],
type = 'l',
xlab = 'Treatment value',
ylab = 'Delta failure rate')
I interpret this plot as showing that, for values near the mean treatment value, the difference in serine composition is greatest. Biologically that makes sense to me. We don’t know what is happening at extremely high and low serine concentrations, but those are presumably special cases. Typical proteins act as we expect, although the difference is small. What does that mean for the effect size?
One measure of the effect size is the raw mean difference, which we know to be ~0.002. The raw mean difference can also be observed on the plot of failure rate as a function of treatment value, as the difference along the x axis between the two lines at 0.5 on the y axis. The problem with using this value a measure of effect size is that it is blind to variance within each sample set. To take this into account we can instead use the standardized mean difference, calculated as such:
## calculate standard mean distance
ser_smd <- (mean(serine_treat) - mean(serine_control)) / sd(serine_control)
In this analysis the standard mean distance is 0.100. The Kraimer paper gets a little arm wavy at this point with the admission that the selection of an appropriate standard mean distance is arbitrary. Other work is cited that suggests values of 0.2, 0.5, 0.8 as “small”, “medium”, and “large” differences. It’s trivial to work out what difference in mean serine composition would yield a standard mean distance of 0.5:
## calculate the difference in means that would give
## a standard mean distance of 0.5
delta_m <- 0.5 * sd(serine_control)
That works out to 0.011, roughly a five-fold increase over our observed mean distance of 0.002. The conceptual problem that I’m having, however, is how appropriate it is to apply these stringent criteria to the problem at hand. These methods were optimized for clinical studies, and a stated goal of Kraemer, 1992 is to help determine when it is good policy to recommend a treatment. It is desirable to know that the proposed treatment produces a large improvement in the majority of patients before it is adopted as general practice. That is very different from what we are trying to do and I don’t think the concept of effect size can be treated the same in each case. I propose the following analogy to our study:
Suppose you own a house with a front lawn and a back lawn. You suspect that one grows faster than the other and would like to test this hypothesis. You mow the lawns on the same day, and a week later you carefully measure the length of 10,000 randomly selected blades from each lawn. You find that the lengths are normally distributed, with a mean of 10.00 cm in the front lawn and 10.35 cm in the back, with a standard deviation of 3.5 cm. These values are proportional to our serine values and would yield the same p-value. Would it be appropriate to say that the back lawn is growing faster? I believe that the answer is yes; the p-value shows that the probability that this happened by chance is vanishingly small. If my goal was to grow grass faster, and I had spent some considerable effort or money fertilizing my back lawn, I would probably not use these numbers to justify continued effort (i.e. the effect size is small). If, however, my goal was to understand the ecology of my yard, such as the impact of sun, water, dogs, kids, etc. on the amount of biomass in my yard, I would definitely take note of these numbers.
We would like to know the effect of evolved adaptation to cold environments on the amino acid composition of various proteins. We know the effect will be small, but we must keep in mind that this is an ecological question that requires an ecological* frame of mind.
So what to do next? Assuming that I can carve out time to redo this analysis I think I can make a more convincing case by counting the number of psychrophile-mesophile homologous pairs of proteins for which the psychrophile has a greater serine content. We could then apply a paired t-test and hopefully bolsters the case that the observed signal, while small, is not there by chance. The issue of effect size will have to be dealt with in an improved discussion – far more terrifying than the thought of further analysis.
Got a thought about this problem or an issue with my interpretation of Kraemer? Leave a comment!
*I might be taking too much liberty with ecological here. Consider a gene expression study (for example), very much an ecological exercise. In the case of determining interesting differences in the rate of gene expression I think the Kraemer methodology would be very appropriate.