Making friends in the DOC pool

I’m going to keep draining the pool analogy until it’s dry…

I’m going to miss the biological oceanography journal club at LDEO tomorrow in order to attend the Sequencing the Urban Genome symposium at the NY Academy of Sciences.  It’s a step away from my primary field, but I’ve been to some human/built environment microbiome symposia before and found them to be fascinating.  All microbial ecology gets at the same fundamental set of questions, even as the environment being studied changes dramatically, so it is often useful to see how a different group of researchers are addressing the same problems that we are.  I was really looking forward to our paper discussion (to be led by Kyle Frischkorn), however, so to make up for it I decided to do a quick write up.

This week’s paper comes out of Ginger Armbrust’s lab at the University of Washington.  I was fortunate to know lead author Shady Amin while he was a postdoc with Ginger, he’s now an assistant professor at NYU’s Dubai campus.   The paper, Interaction and signalling between a cosmopolitan phytoplankton and associated bacteria, is another great example of what can happen when marine chemists team up with microbial geneticists, something that’s been happening a lot lately.  In this case they were able to uncover a two-way communication pathway between a marine diatom and a bacterial affiliate, based on a well-studied mechanism of communication between plants and bacteria in the rhizosphere.  We’ve known about plant-bacteria communication for a long time, but it was not obvious that a method of communication that works in the solid soil matrix (containing upwards of a billion bacteria per gram) would work in seawater (being liquid and containing only a hundred thousand to a million bacteria per ml).  So we’re several decades behind the plant people in understanding the directed influence marine bacteria can have on phytoplankton.  But moving fast.

nature14488-f2

Taken from Amin et al. 2015.

As shown in the figure above, the specific mechanism of communication reported by Amin et al. is based on the plant hormone indole-3-acetic acid (IAA).  The bacterium, a Sulfitobacter, produces IAA, which kicks the diatom into overdrive, upregulating genes involved in cell division and carbon fixation.  In turn the diatom releases the IAA precursor tryptophan, taurine, and DMSP, the latter two compounds serving as a carbon source for the bacterium.

What is really remarkable about this is that the relationship is highly specific.  The diatom was tested with many different bacteria isolated from diatom cultures, and the bacterium was grown in the presence of many different phytoplankton isolates.  Only for this specific pairing, however, did the diatom and bacteria both have a positive influence on the growth of the other.

To take the study to the next level Amin et al. measured IAA abundance in surface seawater in the North Pacific and looked for evidence of IAA biosynthesis pathways in several public metatranscriptomes.  Not surprisingly they found plenty of both; measurable levels of IAA in all samples and what seems to me to be a reasonably high number of transcripts associated with IAA pathways (not sure what 107 transcripts L-1 actually means though…).

This was a really cool study, but I’m having a hard time wrapping my head around the implications.  The specificity implied in the experiments contrasts sharply with the abundance of IAA and IAA biosynthesis pathways in the natural environment.  And if the communication pathway provides a major benefit to the bacterium and/or the diatom, why is it rare among laboratory isolates?  What is the cost to both the bacterium and diatom?  For example, diatom respiration is suppressed by IAA, is maintenance deferred in favor of growth?

Posted in Research | Leave a comment

Predicting metabolic pathways in a metagenome

For an ongoing project I needed to predict the metabolic pathways that are present in a metagenome.  This is actually something that I’ve been interested in trying to do for a while, as metabolic pathways can tell us much more about metabolic function than genes alone.  There may be some pre-packaged methods out there for doing this, but I’ve struggled over the years to find one that doesn’t require commercial software or a subscription to a commercial database (i.e. KEGG).

I’ve worked out the following method that relies on DIAMOND, Biopython, and Pathway-Tools.  All are available for free for academic users.  I went down this path at the request of a reviewer for a submitted manuscript, if you use the method please cite the (hopefully) forthcoming work – Microbial communities can be described by metabolic structure: A general framework and application to a seasonally variable, depth-stratified microbial community from the coastal West Antarctic Peninsula.  I haven’t archived the work on a pre-press server (I’m still not sure how I feel about those), but I’m happy to provide a copy of the manuscript and/or the citation.  I’ll try to remember to add the citation here when (if!) the revised manuscript is accepted (Accepted! see link in this article).

The general strategy here is to:

1.  Shake your fist at the hard, cruel world that forced KEGG to go commercial.

2.  Create a DIAMOND database of all the coding sequences present in all completed Genbank genomes.

3.  Use DIAMOND blastx to search all the metagenomic reads against this database.

4.  Create an artificial Genbank format sequence record containing all the features that had a hit.  To avoid time and memory constraints this record will contain only unique BRENDA EC identifiers and gene products.

5.  Use pathway tools to predict the metabolic pathways in this artificial “genome”.

A word of warning… I hacked this together in a non-linear fashion, thus the code presented here is NOT exactly what I ran.  There might be some errors, so if you try this and it doesn’t run just take a deep breath and track down the errors.  Feel free to ping me if you can’t find the problem.  Okay, here we go!

Download the complete genomes from Genbank and go get a fresh cup of coffee:

wget -r -A *.gbk -nc ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/

Next, extract all the feature translations from these Genbank files (yes, you could just download the faa files, but I had reasons for needing to do it this way).  Pay attention to directory locations (here and in the other scripts)!

### user setable variables ###

ref_dir = '/volumes/hd1/ref_genome_database/ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/' # location of the database directory

### end user setable variables ###

import os  
from Bio import SeqIO  

with open('all_genomes.fasta', 'w') as fasta_out, open('all_genomes_map.txt', 'w') as map_out:    
    for d in os.listdir(ref_dir):    
        for f in os.listdir(ref_dir + d):
            if f.endswith('.gbk'):                
                gbk_name = f
                
                for record in SeqIO.parse(ref_dir + d + '/' + gbk_name, 'genbank'):
                    parent = record.seq
                    
                    for feature in record.features:
                        if feature.type == 'CDS':
                        
                            start = int(feature.location.start)
                            end = int(feature.location.end)
                            strand = feature.location.strand
                            try:
                                trans = feature.qualifiers['translation'][0]    
                                feature_id = (start, end, strand)
                                
                                if strand == -1:
                                    sequence = parent[start:end].reverse_complement()
                                else:
                                    sequence = parent[start:end]
                                
                                seqname = record.id + '_' + str(start) + '_' + str(end) + '_' + str(strand)
                                print start, end, sequence[0:10]
                                print >> fasta_out, '>' + seqname + '\n' + trans
                                print >> map_out, d, f, seqname
    
                            except KeyError:
                                continue

Make a DIAMOND database:

diamond makedb --in all_genomes.fasta -d all_genomes_diamond

And run DIAMOND blastx (so fast!).  This took me about an hour with 24 cpus and 32 Gb RAM for a metagenome with 22 million reads…

diamond blastx -d all_genomes_diamond -q test_mg.good.fasta -o test_mg.good.diamond.txt -k 1

Now create the Genbank file containing all features that have a hit.

ref_dir = 'ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/'

import os
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

map_dict = {}

with open('all_genomes_map.txt', 'r') as map_in:
    for line in map_in:
        line = line.rstrip()
        line = line.split()
        map_dict[line[2]] = line[0]

hits = set()
genomes = set()

with open('test_mg.good.diamond.txt', 'r') as diamond:
    for line in diamond:
        line = line.rstrip()
        line = line.split()
        evalue = float(line[10])
        hit = line[1]
        if evalue < 1e-5:
            hits.add(hit)
            print 'hit =', hit
            genome = map_dict[hit]
            genomes.add(genome)

with open('all_genomes_nr/genetic-elements.dat', 'w') as all_genetic_elements:
    print >> all_genetic_elements, 'ID' + '\t' + 'all_genomes_nr'
    print >> all_genetic_elements, 'NAME' + '\t' + 'all_genomes_nr'
    print >> all_genetic_elements, 'TYPE' + '\t' + ':CHRSM'
    print >> all_genetic_elements, 'CIRCULAR?' + '\t' + 'N'
    print >> all_genetic_elements, 'ANNOT-FILE' + '\t' + 'all_genomes_nr.gbk'
    print >> all_genetic_elements, '//'  
    
with open('all_genomes_nr/organism-params.dat', 'w') as all_organism_params:
    print >> all_organism_params, 'ID' + '\t' + 'all_genomes'
    print >> all_organism_params, 'Storage' + '\t' + 'File'
    print >> all_organism_params, 'Name' + '\t' + 'all_genomes'
    print >> all_organism_params, 'Rank' + '\t' + 'Strain'
    print >> all_organism_params, 'Domain' + '\t' + 'TAX-2'
    print >> all_organism_params, 'Create?' + '\t' + 't'

good_features = []
ec = set()
products = set()
    
for d in os.listdir(ref_dir):
    if d in genomes:
        for f in os.listdir(ref_dir + d):        
            for record in SeqIO.parse(ref_dir + d + '/' + f, 'genbank'):            
                for feature in record.features:
                    keep = False
                    if feature.type == 'CDS':
                        try:
                            temp_ec = feature.qualifiers['EC_number']
                            for each in temp_ec:
                                if each not in ec:
                                    keep = True
                                    ec.add(each)
                                    print feature.type, each
                            if keep == True:
                                good_features.append(feature)
                        except KeyError:
                            try:
                                temp_product = feature.qualifiers['product'][0]
                                if temp_product not in products:
                                    good_features.append(feature)
                                    products.add(temp_product)
                                    print feature.type, temp_product
                            except KeyError:
                                continue
            

new_record = SeqRecord(Seq('nnnn', alphabet = IUPAC.ambiguous_dna), id = 'all_genomes_nr', name = 'all_genomes_nr', features = good_features) 
SeqIO.write(new_record, open('all_genomes_nr/all_genomes_nr.gbk', 'w'), 'genbank')

Once you’ve waded your way through that mess you can run the Pathos program in Pathway-Tools from the command line as such:

pathway-tools -lisp -patho all_genomes_nr/ -disable-metadata-saving &> pathos_all_genomes_nr.log

That will take a little while to run.  When it’s complete you can navigate to where Pathway-Tools store your user-generated pathways (probably something like home/you/ptools-local/pgdbs/user).  The pathways will be in the all_genomes_nycyc directory.  The best way to get at that information is to parse the friendly pathway-reports.txt file.

Update 6/5/15 – I was stuck in file parsing mode when I wrote that last paragraph.  Clearly the best way to “get at that information” is to fire up Pathway-Tools, which will let you view all the pathways by hierarchy, and their associated graphics.

Posted in Research | 10 Comments

Trick question: Is the oligotrophic ocean net autotrophic or heterotrophic?

Answer: Both!

For this week’s LDEO biological oceanography journal club  we discussed a very interesting paper recently published in Nature Communications by Pablo Serret et al., titled Both respiration and photosynthesis determine the scaling of plankon metabolism in the oligotrophic ocean.  The paper was suggested by graduate student Hyewon Kim who did an excellent job leading the discussion.

Given the scale of the global ocean, whether it serves as an overall source or sink of CO2 is a big question, with major implications for the flux of CO2 to the atmosphere.  The primary processes controlling this are photosynthesis (autotrophy), a sink for CO2, and respiration (heterotrophy), a source of CO2.  The difference between these (photosynthesis – respiration) is called net community production (NCP).  When NCP is positive, CO2 is fluxing into the surface ocean, when negative, CO2 is fluxing out.  The dynamics that determine exactly how much photosynthesis and respiration is going on are extremely complex and our understanding of both is definitely incomplete.  Clearly light, temperature, and nutrients have something to do with the efficiency of these processes, but so does community composition (controlled in turn by light, nutrients, water column structure), grazing pressure, and a host of other variables.  Predicting the sign of NCP mechanistically, although highly desirable, is beyond our current abilities.  In place of this we rely on estimates of NCP from direct measurements of photosynthesis and respiration.

It is reasonable to think that direct measurements should be better than a mechanistic model anyway, however, without a model there is no way to confirm that the measurements are actually right.  This is particularly important for NCP estimates, as the measurements of both photosynthesis and respiration are all over the board and are known to be fundamentally flawed, as shown in this figure from Williams et al., 2013.

ma50535.f2As categorized in the above figure, measurements of NCP come in two flavors, in vitro and in situIn vitro measurements are typically bottle incubations, conducted on the deck of a ship in a water bath (to maintain in situ temperature) under both light and dark conditions.  The amount of oxygen consumed in the dark provides an estimate of nighttime respiration.  The amount of oxygen produced in the light (slightly shaded to represent in situ light conditions) provides an estimate of daytime NCP.  Subtracting nighttime respiration from daytime NCP gives an estimate of NCP over an entire 24 hour period.  There is lot, however, that can go wrong with this.

One of the biggest issues is “bottle effect”, the influence the incubation conditions have on the microbial community.  Biological oceanographers have known about “bottle effects” for a long time.  We try to minimize these effects, for example by acid washing glassware to eliminate trace metals, but there is no way that a bottle incubation will ever accurately reflect what happens in the ocean.  Even slightly inhibiting or aiding photosynthesis or respiration in these incubations can tip the NCP balance in one direction of the other.

In situ observations are probably more accurate  – at least they look a lot cleaner in the Williams et al. figure – but they have their own set of problems.  In situ values are calculated from the concentration of O2 and argon (Ar) in the water.  Ar is an inert gas and helps constrain the amount of O2 delivered to the water by physical processes.  It is non-trivial (and expensive) to make accurate measurements of these gases in seawater, and the estimate of NCP can be thrown off by unaccounted for physical processes in the water column.

From Serret et al. 2015.  Blue line/closed squares are for the North Atlantic subtropical gyre.  Red line/open squares are for the South Atlantic subtropical gyre.

From Serret et al. 2015. Blue line/closed squares are for the North Atlantic subtropical gyre. Red line/open squares are for the South Atlantic subtropical gyre. Values above the 1:1 line indicate a negative NCP, values below indicate positive NCP.

Back to the paper.  In this paper Serret et al. undertook perhaps the most comprehensive in vitro analysis of NCP to date on a series of cruises following a north/south transect in the Atlantic Ocean.  In addition to the broad geographical area, what’s special about the dataset they collected is not that it’s necessarily right – they are still relying on bottle incubations, after all – but, because the data was collected by the same personnel using the same methods, it might at least be consistently wrong.  That’s a bigger improvement than it sounds.  Previously NCP estimates have assumed that the subtropical gyres (the vast, low productivity regions in the center of each ocean basin) are more or less the same.  Serret et al. conclusively demonstrated that the North Atlantic and South Atlantic gyres have different NCP values; negative in the north (more respiration than photosynthesis) and positive in the south (more photosynthesis than respiration).

In this day and age the idea that the ocean is the same all over is pretty quaint.  We understand that there is tremendous spatial and temporal variability, even when we average across pretty large areas and timescales.  Direct observations of this variability are lacking however, given the size of the ocean and the limited number of oceanographic cruises tasked to studying it.  There are some issues with this study in terms of possible seasonal and geographic bias (one member of our discussion group noted that the northern cruise track veers strongly east), and sample size, but it’s a tremendous improvement over what’s out there.

Posted in Research | Leave a comment

Adult swim – Old carbon in the DOC pool

We’re starting a summer biological oceanography journal club at LDEO and the inaugural paper is by Lechtenfield et al. (2015), published just a couple of months ago in Nature Communications.  Titled Marine sequestration of carbon in bacterial metabolites, the paper is one of several very interesting articles to come out recently on the nature of marine dissolved organic carbon (DOC).  Marine chemists are on the cusp of something akin to where microbial ecologists were in 2005, when major advances in sequence technology suddenly opened up a world of microbial diversity that hadn’t really been anticipated.  Ten years later we’re still grappling with the paradigm shift, in particular with the question of how necessary understanding all this diversity is to understanding the function of the microbial ecosystem.  Interestingly, because microbial enzymes are involved in the modification of DOC components, the question of DOC chemical diversity and microbial diversity are fundamentally intertwined.

Why all the fuss about DOC?  The ocean’s a big place and contains a shocking amount of carbon – around 38,000 gigatons (or pentagrams, if you prefer).  To put that into perspective consider that the atmosphere contains only about 750 Gt.  Given the size of this reservoir, and the sensitivity of the climate system to atmospheric carbon, the amount of carbon going into and coming out of the ocean keeps a number of people up at night.  Most of this carbon is inorganic; CO2 and it’s various sister species.  Although this inorganic pool is influenced by biology through photosynthesis and respiration, it is primarily controlled by rather dull and predictable geological cycles*.  A smaller but still very significant amount, roughly equivalent to the amount of carbon in the atmosphere, is DOC.  DOC is interesting because it can be cycled very quickly or very slowly, depending on its chemical structure.  Structures that are easily broken down by microbial enzymes are called labile.  Those that are not are called refractory.

Until recently marine chemists exploring the nature of the DOC pool primarily undertook targeted studies of specific DOC components.  A researcher, for example, might choose to look at a certain class of lipids.  Chemical properties of the target compound class, such as solubility in different solvents, would be exploited to isolate the compounds from seawater, and the compounds would then be analyzed by mass spectrometry and other methods.  Such methods require very precise a priori knowledge of the compounds of interest – a problem for the highly complex and poorly characterized marine DOC pool.

Lechtenfeld et al. used solid phase extraction (SPE) in combination with mass spectrometry, and, more importantly, nuclear magnetic resonance spectroscopy (NMR) to undertake a broad-spectrum characterization of the DOC pool.  By combining these methods they were able to consider both compositional diversity (the various combinations of CHNOP and S that make up organic matter) and structural diversity (the actual arrangements of these elements into meaningful functional groups), without a priori knowledge of what they were looking for.

Using these methods they analyzed the DOC pool produced by marine microbes grown on simple and well defined carbon sources along with the DOC pool of natural seawater.  What they found was, first of all, that the DOC pool is enormously complex, as illustrated in the picture below.  The figure draws heavily on this incredibly detailed work.  I spent a little time with it and I’m still not sure exactly how to interpret the figure – but my best guess is that each point represents a unique molecular configuration, and that (as indicated on the plot) certain regions are diagnostic of specific functional groups.   The plot is in fact a 2D representation of the NMR data, so it helps to think of it in terms of the more familiar independent 1D representations before trying to put it all together.

Pages from Marine sequestration of carbon in bacterial metabolites-3

2D NMR spectrogram from Lechtenfeld et al. 2015.

In this figure the left two frames are DOC from the culture experiments while the right is from near-surface seawater.  While the seawater is clearly more complex, the cultures produced an astonishingly diversity of chemical compounds in a very short amount of time.  Interestingly, roughly %30 of the molecules they produced belong to a class of molecules called carboxyl-rich alicyclic molecules (CRAM).  One example CRAM are the hopanoids, pictured below sans carboxyl-rich side chain.

From Wikipedia at http://en.wikipedia.org/wiki/Hopanoids#/media/File:Hopanoid_01.png.

From Wikipedia at http://en.wikipedia.org/wiki/Hopanoids#/media/File:Hopanoid_01.png.

Due to their cyclic nature CRAM and related molecules can be very resistant to microbial degradation.  I went to a very cool talk recently where it was proposed that these compounds are most readily degraded when ocean water cycles through the aquifer on hydrothermal ridge flanks.  That happens more than one might think, but your average CRAM still has a long wait before its particular packet of water happens to cycle through the aquifer.

So the point is that marine bacteria are efficient little factories for converting simple, labile organic compounds into complex, refractory ones.  This increases the average age of the DOC pool and the amount of photosynthetically fixed carbon that can be sequestered in the deep ocean and sediment.  Why and how, exactly, they do this is an open question, as is exactly how refractory is refractory?

 

 

*I know you geochemists have a sense of humor.

Posted in Research | Leave a comment

Protein flexibility calculation with Python

In a couple chapters of my dissertation I made use of protein flexibility values calculated with the ProtParam module in Biopython (Bio.SeqUtils.ProtParam).  The flexibility function makes use of amino acid β-factors as reported by Vihinen et al. (1994).  We were called out in a recent review for (among other things) not using more up-to-date β-factors, particularly those reported in Smith et al. (2003).  I’m sure there’s some way to wrench open the hood and modify the ProtParam function so that it uses the newer β-factors, but that doesn’t sound like much fun.  Instead I generated a new function that takes any protein sequence (as a string) and returns a list of flexibilities, smoothed with a 9 residue window.  I knocked this up quick, so test before you use 🙂

def flexcalc(prot):
    b_factor = {'A': '0.717', 'C': '0.668', 'E': '0.963', 'D': '0.921', 'G': '0.843', 'F': '0.599', 'I': '0.632', 'H': '0.754', 'K': '0.912', 'M': '0.685', 'L': '0.681', 'N': '0.851', 'Q': '0.849', 'P': '0.85', 'S': '0.84', 'R': '0.814', 'T': '0.758', 'W': '0.626', 'V': '0.619', 'Y': '0.615'}
    b_factor_rr = {'A': '0.556', 'C': '0.607', 'E': '0.805', 'D': '0.726', 'G': '0.651', 'F': '0.465', 'I': '0.51', 'H': '0.597', 'K': '0.863', 'M': '0.575', 'L': '0.504', 'N': '0.735', 'Q': '0.729', 'P': '0.752', 'S': '0.698', 'R': '0.676', 'T': '0.648', 'W': '0.577', 'V': '0.503', 'Y': '0.461'}
    b_factor_rf = {'A': '0.704', 'C': '0.671', 'E': '0.911', 'D': '0.889', 'G': '0.811', 'F': '0.582', 'I': '0.617', 'H': '0.734', 'K': '0.862', 'M': '0.641', 'L': '0.65', 'N': '0.848', 'Q': '0.817', 'P': '0.866', 'S': '0.846', 'R': '0.807', 'T': '0.742', 'W': '0.609', 'V': '0.603', 'Y': '0.567'}
    b_factor_ff = {'A': '0.847', 'C': '0.67', 'E': '1.11', 'D': '1.055', 'G': '0.967', 'F': '0.653', 'I': '0.686', 'H': '0.894', 'K': '1.016', 'M': '0.74', 'L': '0.788', 'N': '0.901', 'Q': '1.007', 'P': '0.857', 'S': '0.914', 'R': '0.942', 'T': '0.862', 'W': '0.656', 'V': '0.707', 'Y': '0.741'}
    
    rigid = set(['A', 'C', 'F', 'I', 'H', 'M', 'L', 'W', 'V', 'Y'])
    flex = set(['E', 'D', 'G', 'K', 'N', 'Q', 'P', 'S', 'R', 'T'])
    
    ## get beta factors for each residue, according to neighbors
    
    b_factors = []
                    
    for r in range(0, len(prot)):
        b_query = False ## just making sure something doesn't go awry and an old value is used
        query = prot[r]
        if r == 0:
            b_query = b_factor[query]
        elif r == len(prot) - 1:
            b_query = b_factor[query]
        else:
            if prot[r - 1] in rigid:
                if prot[r + 1] in rigid:
                    b_query = b_factor_rr[query]
                elif prot[r + 1] in flex:
                    b_query = b_factor_rf[query]
            elif prot[r - 1] in flex:
                if prot[r + 1] in rigid:
                    b_query = b_factor_rf[query]
                elif prot[r + 1] in flex:
                    b_query = b_factor_ff[query]
        if b_query == False:
            print(r, 'bad b-factor!')
            exit()
        else:
            b_factors.append(b_query)
            
    ## average over 9 aa window
            
    b_factors_win = []
            
    for b in range(0, len(b_factors)):
        win = b_factors[b:b + 9]
        if len(win) == 9:
            b_win = sum(map(float, win)) / 9.0
            b_factors_win.append(b_win)
            
    return(b_factors_win)

Naturally we’d like to see how this compares with ProtParm.  Here’s a quick comparison using an aminopeptidase from the psychrophile Colwellia psychrerythraea 34H:

prot = 'MKHFSKLCFLLSTFAVSIAPVTWAHEGATHQHANVSKLTDAYTYANYDQVKATHVYLDLNVDFDKKSLSG'\
'FAELSLDWFTDNKAPLILDTRDLVIHRVMAKNSQGQWVKVNYDLAKRDDVLGSKLTINTPLNAKKVRVYY'\
'NSTEKATGLQWLSAEQTAGKEKPFLFSQNQAIHARSWIPIQDTPSVRVTYTARITTDKDLLAVMSANNEP'\
'GTERDGDYFFSMPQAIPPYLIAIGVGDLEFKAMSHQTGIYAESYILDAAVAEFDDTQAMIDKAEQMYGKY'\
'RWGRYDLLMLPPSFPFGGMENPRLSFITPTVVAGDKSLVNLIAHELAHSWSGNLVTNESWRDLWLNEGFT'\
'SYVENRIMEAVFGTDRAVMEQALGAQDLNAEILELDASDTQLYIDLKGRDPDDAFSGVPYVKGQLFLMYL'\
'EEKFGRERFDAFVLEYFDSHAFQSLGTDNFVKYLKANLTDKYPNIVSDNEINEWIFKAGLPSYAPQPTSN'\
'AFKVIDKQINQLVTDELTLEQLPTAQWTLHEWLHFINNLPVDLDHQRMVNLDKAFDLTNSSNAEIAHAWY'\
'LLSVRADYKEVYPAMAKYLKSIGRRKLIVPLYKELAKNAESKAWAVEVYKQARPGYHGLAQGTVDGVLK'

test1 = flexcalc(prot)

from Bio.SeqUtils.ProtParam import ProteinAnalysis as PA
test_seq = PA(prot)
test2 = PA.flexibility(test_seq)

import matplotlib.pyplot as plt

plt.scatter(test1[1:], test2)

Which produces the following.  ProtParam values on the y-axis and our values on the x-axis.  Note that the scales are different, they are arbitrary.  I haven’t figured out how to fit a linear model in Python yet (on the to do list), but the fit looks pretty good.  Plenty of scatter but the methods are in general agreement (which is good – we already have one paper published using the old method!).  Time to go back and redo a lot of work from the last two years…

figure_1

Posted in Research | 4 Comments

A tale of two studies: Big Data and a cautionary analogy

One of the chapters of my dissertation was just rejected for publication a second time.  I haven’t decided what to do with it yet, but in the meantime it makes an interesting case study in rejection.  In brief, what we tried to do in this paper was make a large scale statistical inference of amino acid composition in a taxonomically controlled group of psychrophiles (cold adapted bacteria) and mesophiles (room temperature adapted bacteria).  A protein derives its physical characteristics (structure, flexibility, substrate specificity, etc.) from the physiochemical properties of each amino acid (size, charge, solubility in water, etc.).  As proteins in psychrophiles evolve it is therefor reasonable to expect that they will preference certain amino acids associated with the structural modifications most likely to enhance protein function at low temperature.  Because protein evolution, like all evolution, is a randomized process, we expect differences in amino acid composition between the two groups to be small and messy, but identifiable with a large enough sample size.

A fair bit of previous work has been done in this area, including Metpally and Reddy, 2009.  There are, however, some issues with that analysis.  Metpally and Reddy used a small number of predicted proteomes from psychrophiles and mesophiles (it was 2009, after all), and did not control for multiple comparisons in the large number of Students t-tests that they performed between these two groups.  We hoped to improve on this with a larger, carefully controlled set of predicted proteomes and a more robust statistical approach.

The idea that differences in amino acid composition reflect adaptation to different thermal regimes is not popular in all circles.  In our first submission we received two reviews, one positive and one negative.  The negative review came from a protein chemist who does very high resolution dynamical studies of proteins as they interact with their substrates.  He noted that it takes only a single amino acid to effect a major change in, say, the flexibility of a protein and thus its function under different temperatures.  This is true, but I would counter that evolution doesn’t know to target a specific high-payoff residue for mutation.  This process is random; every now and again a residue in a key dynamical position will get modified in a way that improves protein function.  Much more often residues in less key positions will mutate in a way that helps just a little bit.  We would expect to find a consistent pattern in amino acid substitutions across psychrophile proteins as a result of these mutations.

After some minor changes we resubmitted the manuscript to a different journal.  I give the second editor much credit for going the distance and ultimately soliciting five different reviews.  The reviews were roughly 3 in favor and 2 against; usually one dissenter is enough to prevent publication.  One of the objections raised, and the subject of this post, was our use of the t-test and p-value to describe the statistical significance of differences in amino acid composition between psychrophiles and mesophiles.  I usually think of large sample size as a good thing as it allows the detection of correlations that would otherwise be masked by noise or other effects.  In this case, however, the large number of proteins in our dataset seems to have gotten us into some trouble.

We used predicted proteomes from 19 psychrophiles and 19 mesophiles, roughly 130,000 proteins total.  We made over 100 tests for differences in parameters (such as amino acid composition) between the two groups, applying the Holm-Bonferroni method to control for the multiple comparisons.  After control we had some number of parameters that exceeded our (Holm-Bonferroni corrected) threshold for significance.  A t-test of serine content, for example, yielded a p-value of essentially 0.  The mean proportion of serine, however, was not that different between the groups, at 0.0658 for the psychrophiles and 0.0636 for the mesophiles.  We considered these slight preferences as consistent with our ideas of protein evolution and reported them.

To quote an anonymous reviewer: “Any inferential statistics test, being it parametric or non-parametric, reaches any desired statistical significance level at increasing N (sample size). This makes the issue of biological relevance drastically separated from the problem of statistical significance (any difference between two samples becomes significant at increasing N)“.  The reviewer is concerned that the very small differences that we observed, while statistically significant, are not biologically relevant.  They go on to recommend the treatment of this issue in Kraemer, 1992.  That paper, which draws on a hypothetical medical study using a t-test between a treatment and control group, makes some interesting points.  I’ll attempt to reconstruct what they do in the context of differences in serine content between the two groups, and end with what I think it means for our study and what we might do about it.

First we can read in some data and take a look at the distributions (if you want to play with the code just use rnorm to generate some normally distributed data).

serine <- read.table('serine.txt.gz', header = F)
serine_treat <- serine[which(serine[,1] == 'cold'),2]
serine_control <- serine[which(serine[,1] == 'control'),2]

## look at the distributions

treat_hist <- hist(serine_treat, breaks = 100)
control_hist <- hist(serine_control, breaks = 100)

plot(control_hist$counts ~ control_hist$mids,
     col = 'red',
     type = 'l',
     ylab = 'Count',
     xlab = 'Serine composition')

points(treat_hist$counts ~ treat_hist$mids,
       type = 'l')

lines(c(mean(serine_treat), mean(serine_treat)),
       c(-1000,8000),
      lty = 2)

lines(c(mean(serine_control), mean(serine_control)),
      c(-1000,8000),
      col = 'red',
      lty = 2)

legend('topright',
       legend = c('Treatment', 'Control', 'Treatment mean', 'Control mean'),
       col = c('black', 'red', 'black', 'red'),
       lwd = 1,
       lty = c(1,1,2,2))

That produces the following plot.  Notice how close the two means are.  They are very statistically significantly different by the t-test, but is this biologically meaningful?  The Kraemer paper uses the term effect size to describe biological impact.

Rplot

Central to the Kraemer ideas of effect size is a plot of failure rate (in our analysis the fraction of the control group that is above a given value of the control group) as a function of the control values.  The more different the failure rates for the treatment and control, the greater the effect size (the more biologically meaningful the result, or so the thinking goes).

## failure rate analysis
lc <- length(serine_control)
lt <- length(serine_treat)

sr <- matrix(ncol = 3, nrow = lt)

i <- 0
for(st in sort(serine_treat)){
  i <- i + 1
  print(i)
  fr_sc <- length(which(serine_control < st)) / lc
  fr_st <- length(which(serine_treat < st)) / lt
  
  sr[i,] <- c(st, fr_sc, fr_st)
}

plot(sr[,3] ~ sr[,1],
     ylab = 'Failure rate',
     xlab = 'Response value',
     type = 'l')

points(sr[,2] ~ sr[,1],
       col = 'red',
       type = 'l')

legend('bottomright',
       legend = c('Treatment', 'Control'),
       col = c('black', 'red'),
       lwd = 1)

And that, in turn, produces the following plot of failure rate.

Rplot02

Note that treatment value is synonymous with serine composition.  I’m trying to keep the nomenclature somewhat consistent with Kraemer, 1992.  From this plot we can see that the failure rate is pretty similar for a given treatment value for the two groups, but not exactly the same.  This can be explored further as a plot of failure rate difference as a function of treatment value.

plot((sr[,2] - sr[,3]) ~ sr[,1],
     type = 'l',
     xlab = 'Treatment value',
     ylab = 'Delta failure rate')

Rplot03

I interpret this plot as showing that, for values near the mean treatment value, the difference in serine composition is greatest.  Biologically that makes sense to me.  We don’t know what is happening at extremely high and low serine concentrations, but those are presumably special cases.  Typical proteins act as we expect, although the difference is small.  What does that mean for the effect size?

One measure of the effect size is the raw mean difference, which we know to be ~0.002.  The raw mean difference can also be observed on the plot of failure rate as a function of treatment value, as the difference along the x axis between the two lines at 0.5 on the y axis.  The problem with using this value a measure of effect size is that it is blind to variance within each sample set.  To take this into account we can instead use the standardized mean difference, calculated as such:

## calculate standard mean distance
ser_smd <- (mean(serine_treat) - mean(serine_control)) / sd(serine_control)

In this analysis the standard mean distance is 0.100.  The Kraimer paper gets a little arm wavy at this point with the admission that the selection of an appropriate standard mean distance is arbitrary.  Other work is cited that suggests values of 0.2, 0.5, 0.8 as “small”, “medium”, and “large” differences.  It’s trivial to work out what difference in mean serine composition would yield a standard mean distance of 0.5:

## calculate the difference in means that would give
## a standard mean distance of 0.5
delta_m <- 0.5 * sd(serine_control)

That works out to 0.011, roughly a five-fold increase over our observed mean distance of 0.002.  The conceptual problem that I’m having, however, is how appropriate it is to apply these stringent criteria to the problem at hand.  These methods were optimized for clinical studies, and a stated goal of Kraemer, 1992 is to help determine when it is good policy to recommend a treatment.  It is desirable to know that the proposed treatment produces a large improvement in the majority of patients before it is adopted as general practice.  That is very different from what we are trying to do and I don’t think the concept of effect size can be treated the same in each case.  I propose the following analogy to our study:

Suppose you own a house with a front lawn and a back lawn.  You suspect that one grows faster than the other and would like to test this hypothesis.  You mow the lawns on the same day, and a week later you carefully measure the length of 10,000 randomly selected blades from each lawn.  You find that the lengths are normally distributed, with a mean of 10.00 cm in the front lawn and 10.35 cm in the back, with a standard deviation of 3.5 cm.  These values are proportional to our serine values and would yield the same p-value.  Would it be appropriate to say that the back lawn is growing faster?  I believe that the answer is yes; the p-value shows that the probability that this happened by chance is vanishingly small.  If my goal was to grow grass faster, and I had spent some considerable effort or money fertilizing my back lawn, I would probably not use these numbers to justify continued effort (i.e. the effect size is small).  If, however, my goal was to understand the ecology of my yard, such as the impact of sun, water, dogs, kids, etc. on the amount of biomass in my yard, I would definitely take note of these numbers.

We would like to know the effect of evolved adaptation to cold environments on the amino acid composition of various proteins.  We know the effect will be small, but we must keep in mind that this is an ecological question that requires an ecological* frame of mind.

So what to do next?  Assuming that I can carve out time to redo this analysis I think I can make a more convincing case by counting the number of psychrophile-mesophile homologous pairs of proteins for which the psychrophile has a greater serine content.  We could then apply a paired t-test and hopefully bolsters the case that the observed signal, while small, is not there by chance.  The issue of effect size will have to be dealt with in an improved discussion – far more terrifying than the thought of further analysis.

Got a thought about this problem or an issue with my interpretation of Kraemer?  Leave a comment!

*I might be taking too much liberty with ecological here.  Consider a gene expression study (for example), very much an ecological exercise.  In the case of determining interesting differences in the rate of gene expression I think the Kraemer methodology would be very appropriate.

Posted in Research | 1 Comment

Antarctic frost flower paper submitted

Moments ago I finally submitted the final frost flower paper from my PhD, a paper which first found life as Appendix 2 in my dissertation.  Once upon a time I thought this particular project was going to be the central pillar of my doctoral work.  It didn’t quite work out that way, but I’m really glad to see this manuscript go off into the world.  I wish it well.

Most of the frost flower and young sea ice work that I conducted with my adviser, Jody Deming at the University of Washington, took place in the Arctic.  Starting all the way back in 2009 (and thanks to collaborators in the OASIS project – a collaboration that, in retrospect, we should have worked harder to maintain) we sampled frost flowers and young sea ice during the winter and spring near Barrow, Alaska.  Near Barrow an offshore polynya, part of the circumpolar flaw-lead system, provides access to a region of ice formation all through the winter.  The fickle nature of the polynya and issues with access and timing however, lead us to consider other possibilities.  At about the same time I got interested in a series of papers suggesting that frost flowers could be the source of sulfate depleted sea salts in coastal Antarctic glaciers.  Repeatedly skunked on Barrow sampling trips and with the pressure mounting to develop a thesis topic I hatched a plan to develop a biological side to this sea salt story.  If frost flowers were responsible for transporting salt to the Antarctic interior, and our work had demonstrated that frost flowers were enriched in bacteria, it stood to reason that bacteria would be transported as well.  This could link the marine and glacial microbial environments in a previously unexplored fashion, with implications for gene flow to and microbial habitation of the expansive glacial environment.

We proposed all this to NSF and received partial support to conduct a pilot study (see NSF award 1043265).  Our efforts the following (2011) field season are well documented in the earliest entries on this blog, starting here.  After major bureaucratic delays (the McMurdo Station personnel were very reluctant to allow sampling in the young sea ice zone), blizzards, injuries, mishaps involving Weddell seals, and various other trials and humiliations that I’ve probably purged from my memory we emerged from Antarctica with samples of ice from the surface of Taylor and Wilson-Piedmont Glaciers, snow, frost flowers, newly formed sea ice, and seawater.  Our plan was to sequence the 16S rRNA gene from DNA extracted from these samples (a standard way of describing community composition and structure in microbial ecology) and look for overlaps.  Abundant frost flower microbes that also appeared in glacial ice, for example, would indicate that our proposed transport mechanism was active.

Collecting frost flower and young sea ice from Lewis Bay on the north side of Ross Island in October of 2011.

The author collecting frost flowers and young sea ice from Lewis Bay on the north side of Ross Island, October of 2011.

Our first sequencing effort was a complete disaster.  Weeks of work went down the drain with our primary samples.  It still isn’t clear where things went wrong; in our lab, at the (nameless if not blameless) sequencing center, or both.  My money’s on both.  Thankfully we had a limited set of backup samples and just enough funding left to try one more time.  This time we switched from the 454 to the Illumina sequencing platform, and placed our samples in the capable hands of the sequencing center at the Argonne National Lab.  A few weeks later we had data.  By this time, however, I’d moved on to other things and was cramming to finish the existing chapters of my dissertation.  Such is the way things go.  I had just enough time to do an initial workup and staple it to the back of my dissertation as an appendix so that it would at least exist somewhere if I never got back around to it.

Fortunately I have a couple of frightening manuscript deadlines coming up, and there is no better way to deal with an impending deadline than to ignore it completely and find something else to do.  In this case reforming the half-baked appendix into a proper manuscript (and then writing a blog article about it) was the perfect avoidance mechanism (tomorrow I’m on it).

With the back story complete, what did we find in the Antarctic frost flowers?  Definitely not what we initially expected.  We found very little evidence of marine bacteria being deposited and preserved in either glacial ice or snow (this doesn’t mean they aren’t transported there, they probably just don’t survive the trip or last long when they get there).  We did, however, find a significant number of terrestrial bacteria in frost flowers samples, particularly cyanobacteria of the genus Pseudanabaena (commonly found in melt pools on the surface of glaciers) and an add assortment of non-marine sulfur oxidizers.  We think that all of these are the result of wind-driven transport from land the the ice surface, with the latter coming from the sulfur-rich environs around volcanic Mt. Erebus.  What all this means for the microbial ecology of the sea ice surface is not clear.  Armed only with 16S rRNA gene sequences we can’t do much more than speculate.  Further work will have to be done to see if these bacteria are doing anything of interest at the ice surface.

It is worth noting that we have now published community structure data from frost flowers from three different environments, and that the story behind each is very different.  In our earliest efforts at Barrow we collected frost flowers from a highly productive coastal system and found a strange assortment of putatively marine Rhizobiales.  We went to some pretty great lengths in a 2014 paper to demonstrate that these don’t look like terrestrial Rhizobiales deposited by wind, and at any rate the Barrow coastline was pretty well covered with snow when we were there.  In later efforts at Daneborg in Greenland, Jody Deming and a team of collaborators collected frost flowers from a highly oligotrophic fjord.  We found that the microbial community in these looked pretty much like the community in seawater.  Last, we know have these odd frost flowers from off Ross Island in Antarctica, which appear to contain some very interesting bacteria from various adjacent environments.  So the overall story seems to be that the sea ice surface – a warmer and more chemically active environment than one might think – can harbor a diverse array of bacteria.  Which bacteria and from where depends heavily on the dynamics of the surrounding area, including what’s happening in the water when the ice forms, the extent of snow cover, wind magnitude and direction, and probably some things we don’t know about yet!

For my postdoc I’m firmly out of young sea ice and into the water column, looking at microbial processes around the West Antarctic Peninsula.  Someday, however, it will be good to get back around to young sea ice.  If published this will be our fifth paper on the subject and I feel like we hardly have any answers!

Posted in Research | Leave a comment

Some climate fun with Google Earth

My postdoctoral adviser at Lamont showed me something really cool today that I suspect is pretty common knowledge (it was new to me!).  Google Earth has a feature that allows you to view historic aerial/satellite photos for any point on Earth, or at least for any place that was interesting enough to someone in the past that they took a photo of it.

Fortunately one of those places is Palmer Station, Antarctica, home to a small US research base and a really large glacier.  There are photos of the site available from 1963, 1975, 2004, 2008, and 2013.  What is as remarkable as it is expected is the dramatic retreat of the glacier.  Using nothing other than the ruler tool in Glacier Earth anyone can work up a neat little dataset of glacial retreat.  Here are the photos from each year, in order:

1963 1975 2004 2008 2013

That’s a pretty dramatic change.  The yellow line is my ruler, I tried to measure along the same aspect to a well defined point along the glacier’s tongue each year.  Obviously it’s a little bit subjective.  We can take this a step further and plot out the values:

distance

Never mind the Excel plot. It’s too late for R.  The big data gap between 1975 and 2004 is unfortunate; there’s a photo for 1999 but it’s not useable.  The data shows that the glacier has retreated nearly half a kilometer from Palmer Station in 50 years.  That’s not a particularly dramatic glacial retreat relative to some but still impressive to see.  More importantly the rate of retreat is increasing, up to around 15 m per year from 5 m per year at the start of the record.

Of course all of this has implications for the ecosystem around Palmer Station.  The Palmer Long Term Ecological Research project has documented a big increase in the proportion of glacial runoff in coastal seawater along the West Antarctic Peninsula.  This increase is linked to changes in phytoplankton community structure, the base of a short and tightly coupled foodweb.  You can read more about that here.

*** UPDATE ***

Hugh Ducklow, my adviser here at Lamont, gave me permission to post the following images and further explanation.  He’d found the images after plotting, in Google Earth, the locations of some temperature data recorders placed in the soil in a line between Palmer Station and the glacier in the Austral summer of 2014.  The location of the temperature probes can be seen in the following images.  It is, of course, important to remember that the glacier retreats in fits and starts (something that can be seen in the data above).  Years of stability can be followed by catastrophic melt, and the other way around.  Here are the years 1975, 2004, and 2013, with the location of the loggers superimposed.

glacier1 glacier3
glacier2

 

Posted in Research | Leave a comment

Dimension reduction with PCA

I recently had an interesting problem that I think I successfully solved using principal component analysis (PCA).  PCA is one of a family of multivariate statistical methods known as ordinations.  Frankly it, and related techniques, scare me a little.  They’re just a little too  easy to implement in your favorite statistical workspace without any idea how to interpret the output.  Ideally one would spend some time under the hood, getting comfortable with the inner workings of the method before applying it.  I’m not the best at symbolic reasoning, so in these situations I usually rely on some application examples to help me make the jump from the stats book to my analysis.  Often, however, I’m trying to do something a little bizarre for which I can’t find a great guiding example.  That was how my week started, but I think it ended up in a pretty good place.

My problem breaks down like this.  I’ve got 2,700 observations (genomes) described by 3.2 million variables (never mind what the variables are at the moment – could be anything).  I want to calculate the Bray-Curtis distance between these genomes.  I’ve got 32 Gb of RAM, performing distance matrix calculations on a 9 billion cell matrix isn’t going to happen (I tried anyway, using R and in Python using pdist from scipy.spatial, fail).  But what if I don’t need all 3.2 million variables?  It’s reasonable to assume that many, probably the vast majority, of these variables aren’t contributing much to the variance between genomes.  If I can get rid of these superfluous variables I’ll have a matrix small enough to perform the distance calculations.

Enter PCA.  PCA essentially generates artificial variables (axis – or principal components) to describe the variance between samples.  The original variables can be described in terms of their magnitude, or contribution to each principal component.  The principal components are ordered by how much variance they account for.  In ecology you frequently see plots of the first two principal components (the two that account for the largest variance), with arrows depicting vectors that describe the magnitude of the true variables along both principal components.

In PCA you end up with as many PCs as you had samples to start with.  Some of those principal components aren’t describing much in the way of variance.  You can evaluate this with a scree plot.  Here’s a scree plot for the 200 PCs in my analysis:

skree1

There are two inflection points on this plot.  Normally you would be able to ignore all but the first few principal components (i.e., before the first inflection point), but in this case that isn’t accounting for much variance – maybe 30 %.  I don’t think there’s any harm done by retaining additional PCs, other than a larger matrix size, so I chose to use all PCs to the left of the second inflection point, marked with the vertical line.  That accounts for about 90 % of the variance.  Note that this hasn’t done anything to reduce the number of variables in the original analysis, however.

For that I needed to evaluate a second scree plot, of the sum of the magnitudes for each variable, for the PCs that I elected to keep.  In my case the matrix columns are PCs, the rows are variables, so I’m simply summing across the rows.  Before doing this I normalize each magnitude by multiplying it by the proportion of variance accounted for by the relevant principal component (note that magnitude = absolute value of the variable).  Here’s the second scree plot:

scree2

Yes, this is a messy plot, it’s getting very late on a Friday evening…

 

Again, there’s an obvious inflection point marked by the vertical line.  It suggests that there are 100,000 variables that are most influential on the PCs that account for most of the variability.  In short, I can use these 100,000 variables and ditch the other 3.1 million.  Are distance matrices calculated from both the full and partial set of variables comparable?  As the plot below shows, they aren’t quite linearly correlated, but there is very good agreement between the two calculations (R = 0.97).

scatter

One final note on this. I ran the PCA and distance matrix calculations in both R and Python.  In Python I used scipy.spatial.pdist for the distance matrix and matplotlib.mlab.PCA for PCA.  In R I used prcomp and vegdist.  Anecdotally the memory utilization seemed to be quite a bit lower with Python, though I wasn’t watching it that closely.

Posted in Research | Leave a comment

Sea ice biogeochemistry methods review

I was excited to learn that our long awaited sea ice biogeochemistry review was published today in the journal Elementa.  Many of the contributors to the project are members of the SCOR working group Biogeochemical Exchange Processes at Sea Ice Interfaces (BEPSII).  The review anchors a special edition of Elementa, titled after the working group and slated to appear later this year.

The topics covered by the review reflect the authors list, that is, it’s a pretty mixed bag.  There’s something in there for just about everyone; topics range from eddy covariance measurements above sea ice, to standard T/S measurements in sea ice, to nucleic acid extraction from sea ice.  The latter bit was my contribution and meshes well with a review I’m authoring for the Elementa special edition on the link between the taxonomy of the sea ice microbial community and likely biogeochemistry-influencing metabolisms in sea ice.  When putting the section together I was actually surprised at how few studies have undertaken molecular analyses of the sea ice microbial community.  I’ve been using molecular methods on sea ice since 2009, and have cited these various studies extensively, but I never actually sat down and listed them out before.  On one hand its nice to know that so much remains unknown, on the other hand its a little alarming that, given the advanced state of the field in other environments, molecular biology has been so slow to gain traction on sea ice (no pun intended – sea ice isn’t really that slick).

At any rate, thanks to lead author Lisa Miller for pushing this behemoth through and thanks to the BEPSII working group for letting a junior scientist(s) take a role!

Posted in Research | Leave a comment