DHPS: Taking a dip in the DOC pool

An interesting paper was recently published by Durham et al. in PNAS: Cryptic carbon and sulfur cycling between surface ocean plankton.  The all-star author list includes members of the Armbrust Lab at the University of Washington, the Moran Lab at the University of Georgia, Athens, and the Van Mooy Lab at WHOI.

For a long time a compound known as dimethylsulfoniopropionate (DMSP) has been recognized as a major component of both the reduced sulfur and dissolved organic carbon (DOC) pools in the ocean.  As described here, DMSP is produced by phytoplankton in response to various environmental stresses, and is readily consumed by many marine bacteria (some bacteria produce DMS, a climatically active volatile, as a byproduct of this consumption).  In this way it is an excellent example of the microbial loop; dissolved DMSP cannot be directly consumed by zooplankton, but it can be consumed by higher trophic levels after repackaging in microbial biomass.  Presumably, however, DMSP is only one of many important components of the DOC pool.  Determining the other major components turns out to be a rather difficult task.  There are two ways a researcher might go about this.

Pages from Cryptic carbon and sulfur cycling between surface ocean plankton

Figure from Durham et al., 2014 and shows the pathway for the intracellular catabolism of DHPS by R. pomeroyi, ending with the excretion of bisulfite.

First, a chemist might try and analyze the different compounds in seawater and determine, structurally, which are likely to be consumed by marine bacteria.  The second bit is relatively easy, but isolating any compound from the salty, organically complex milieu that is seawater is not.  Only through the application of expensive and time consuming methods can a targeted compound by isolated, and it is necessary to know in advance what the target compound is.

Second, a biologist might try to analyze the genetic composition or gene expression pattern of marine bacteria to determine what enzymes the bacteria are using, or have the capacity to use.  An abundance of transporters specific to a certain compound, or class of compounds, for example, suggests that this compound is, for that bacteria, an important substrate.  Unfortunately determining which enzymes are specific to what compounds is almost as difficult as the bulk analysis of carbon compounds in seawater.  Most of what we know about microbial enzyme specificity comes from studies using E. coli, which may or may not have much to do with what bacteria are doing in the marine environment.  To be absolutely certain about the role of any enzyme it is necessary to perform costly and tedious experiments with a laboratory isolate, a requirement that eliminates the vast, unculturable majority of marine bacteria from any potential analysis.

Durham et al. waded into this particular quagmire with both approaches.  Working with a model system composed of one diatom, Thalassiosira pseudonana, and one bacterium, Roseobacter pomeroyi, they compared the gene expression profiles of bacteria grown in the presence of the diatom with profiles from pure cultures.  Surprisingly, genes strongly upregulated in the presence of the diatom included transporters for the reduced organo-sulfur compound C3-sulfonate 2,3-dihyhdroxypropane-1-sulfonate (DHPS), and genes coding for hydrogenase enzymes involved in DHPS catabolism, including the gene hpsN.  Taking this hint the researchers were able to grow R. pomeroyi on DHPS as the sole carbon source, observing gene expression profile similar to when R. pomeroyi was grown in the presence of the diatom.  Because DHPS was not a component of the media used in the initial experiment it must have been produced by the diatom, and clearly it is an acceptable growth substrate for R. pomeroyi.  But how important is it in the natural environment?

With a target compound and a target gene in hand, Durham et al. were able to quantify DHPS in seawater during a cruise in the North Pacific, while simultaneously collecting metatranscriptomes and metagenomes.  They observed an abundance of hpsN transcripts coincident with diatom cell counts and high DHPS concentrations.  At coastal stations the concentration of DHPS actually exceeded that of DMSP.  This is a great paper, but is just a first look at a story that is sure to have geographic and taxonomic complexity.  I’m sure we’ll be hearing a lot more about DHPS (and other, yet to be revealed components of the DOC pool) in the coming years.

Posted in Research | Leave a comment

Paper published on psychrophile alkane hydroxylases

We just published a paper in BMC Genomics on the occurrence of genes coding for alkane hydroxylases in the genomes of psychrophilic (cold loving) bacteria.  This paper does a couple of interesting things and fails to do several other interesting things.  A little background before we get into that.

For several decades investigators have been interested in the ability of psychrophilic (cold loving) microbial communities to degrade the various components of crude oil, a process known as bioremediation.  Because temperature has a direct impact on enzyme activity there is a concern that, at very low temperatures, the microbial community might be inhibited in its ability to undertake bioremediation.  The benefits of bioremediation were seen in the response of the microbial community during the infamous Deepwater Horizon event in the Gulf of Mexico.  Indigenous bacteria deep in the water column consumed much of the escaping oil before it had the opportunity to reach the surface, lowering (but not eliminating) the ecological impact of that event.

Despite the subtropical location of the Deepwater Horizon, the water temperature where much of this bioremediation took place is only 5 degrees C.  This is not much warmer than summertime Arctic surface waters, and close to the point at which temperature is observed to start inhibiting microbial activity (see figure).  What would happen if an Arctic drill rig had a similar accident?  Would the microbial community be able to respond as effectively?

growth_polar_oceans

Taken from Kirchman et al. 2009. Bacterial growth rate, here a useful proxy for the activity of any system of bacterial enzymes, doesn’t change much with temperature until about 4 degrees. Below 4 degrees C enzyme activity seems to be inhibited most of the time.  The devil is in the detail and we don’t understand the exceptions.  Whether a marine microbial community in -1 degree C water could respond to a crude oil spill as effectively as a community at 5 degrees C is an open question.

To shed a small bit of light on this question we compared a population of psychrophile genomes to a taxonomically similar population of mesophiles (bacteria that live at approximately room temperature), searching both populations for genes coding for alkane hydroxylases.  Alkane hydroxylases are enzymes involved in the degradation of alkanes, a major component of crude oil.  We used a novel strategy to find putative alkane hydroxylase genes in these genomes.  First, we used the excellent hmmscan tool in HMMER 3.0 to identify conserved domains in gene products that are also found in alkane hydroxylases.  Then we used non-metric multidimensional scaling (NMDS) to find clusters of similar sequence (see figure).

fig_2

Taken from Bowman and Deming, 2014. Each point in this NMDS plot of the FA_desaturase protein family is a protein from either the PFAM database (orange), the Uniprot database (black), a mesophile (red), or a psychrophile (blue). The distance between any two points on the plot is related to their sequence similarity. The Uniprot proteins are known alkane hydroxylases. Thus any red and blue points that fall close to black points are hypothetical alkane hydroxylases. You: What’s close? Me: How blue is the ocean? These questions are not without merit but we will not answer them today. The PFAM proteins are included only to provide “landscape” (proteinscape?), which helps to visualize clusters of proteins.

Surprisingly we found many putative alkane hydroxylase genes in both mesophile and psychrophile genomes that were not annotated as alkane hydroxylase.  Typically these genes were annotated simply as “hydroxylase”, or something similarly nondescript.  But here’s where our analysis falls a bit short.  The critical next step is to synthesize these putative alkane hydroxylase genes, clone them into a vector, and express them in a model bacterium (probably E. coli, as I’m not aware of a transformable model psychrophile.  Anyone know one?) to see if they are what we think they are.  Unfortunately that’s a whole project by itself and one can be a PhD student for only so long (I didn’t start this project until pretty late in the game).  It’s on the to do list and, in the meantime, perhaps another lab will pick it up and run with it…

Many thanks to the EPA STAR (Science To Achieve Results) Fellowship program for funding me while I worked on this and related projects.  The EPA STAR program is sadly now defunct, a victim of congressional cost-cutting.  As I understand it the program was perceived as duplicative of the larger NSF GRFP (Graduate Research Fellowship Program).  Setting aside for a moment the vastly different missions of the  EPA and NSF (the former being mission driven and the latter supporting basic research), suggesting that the programs are redundant is a bit like saying a thirsty person doesn’t need a second sip of water, because it duplicates their first.  The NSF GSRFP funds a very small fraction of the graduate researchers who are in need of support.

Posted in Research | Leave a comment

DIAMOND – A game changer?

DIAMOND search speed over BLASTX

Figure from Buchfink et al. 2014. The figure shows the speed up of DIAMOND over BLASTX and an alternate new-generation aligner known as RAPSearch2. Metagenomic datasets from the Human Microbiome Project were aligned against the NCBI nr database.

A very exciting paper was published in Nature Methods a couple weeks ago by Benjamin Buchfink and colleagues: Fast and Sensitive Protein Alignment Using DIAMOND.  The paper debuts the DIAMOND software, touted as a much-needed replacement for BLASTX.  BLASTX has been a bioinformatics workhorse for many years and is (was) the best method to match a DNA sequence against a protein database.  BLASTX worked well in the era of Sanger sequencing.  With routine sequence runs today producing many gigabytes of data (100’s of millions of reads) however, BLASTX is woefully inadequate.   I once attempted to annotate a relatively small metagenome using BLASTX against the NCBI nr database using a high performance cluster of some size and couldn’t to it.  It was theoretically possible, but the resources required were out of alignment with what I stood to gain from a complete annotation.  I pulled the plug on the experiment at around 25 % complete to avoid getting blacklisted.

There are faster alignment (i.e. search) algorithms available, for example the excellent BWA DNA-DNA aligner, but they don’t quite do what BLAST does.  BLAST is valuable for the same reason that it is slow; it’s sensitive.  Another way of thinking about this is that BLAST has good methods for aligning dissimilar sequences and thus returning a more dissimilar (but still desired) match.  Many sequences queried against, for example, the NCBI nr database don’t have a close match in that database.  The more dissimilar the query is from a candidate homologue the more calculations need to be performed for the aligner to propose a biologically plausible alignment.  This is bad enough when aligning two DNA sequences.  Because the amino acid alphabet is larger than the DNA alphabet however, comparing protein sequences is even more computationally intensive.

The developers of DIAMOND used improved algorithms and additional heuristics to build a better aligner.  I’m not going to attempt a detailed description of the algorithms – which I would certainly botch – but the paper refers to three modifications made to the BLAST concept that result in a huge speedup for DIAMOND.  First, DIAMOND uses an optimized subset (seed) of the query and reference sequences to find matches.  The subset is described by the seed weight and shape.  Second, the aligner employs something called double indexing, an improved method for storing information regarding seed position within each sequence.  Finally, the aligner relies on a reduced amino acid alphabet consisting of only 11 amino acids.

So how fast is DIAMOND?  Really, really fast.  The paper describes some basic benchmarks.  I tried it out on an 12 core Linux box.  I did not assess accuracy, but for a basic search it was everything it promised to be and easy to use (what is bioinformatics coming to?).  The pre-compiled Unix executable worked straight out of the box and the DIAMOND developers have kindly copied the BLASTX command structure.  To try it out I aligned the metagenome described above against the Uniref-90 database as such:

diamond makedb --in uniref90.fasta -d uniref90 &&
date > start_stop.txt
diamond blastx -d uniref90 44_trim.mates.fasta -o diamond_test.txt
date >> start_stop.txt

The metagenome contains just over 12 million Illumina sequence reads and this (slightly old) version of Uniref-90 contains just over 15 million.  That’s a lot smaller than nr, but still pretty big.  The DIAMOND default is to use the maximum number of cores available – on this hyperthreaded system it recognized 24.  The whole alignment took only 17 minutes and never more than 16 Gb of memory.  This is such a large speedup over BLASTX that I’m having a hard time wrapping my head around it.  There’s no way that I’m aware of to estimate how long it would take to execute a similar BLASTX search but I think it would be weeks.  It’s hard to convey how exciting this is.  DIAMOND may have just eliminated one of the biggest analytical bottlenecks for environmental sequence analysis (to be replaced by a larger one, I’m sure).

Posted in Research | 4 Comments

Basic computers for bioinformatics, revisited

I recently started a postdoc in a lab without any real computer infrastructure so I was in need of a system with a little more juice than my laptop.  While a graduate student at the UW I did everything except for the lightest-weight bioinformatics on either a MacPro workstation (dating from 2008 but running strong) or the Hyak high-performance cluster.  For the next two years I don’t anticipate needing anything like Hyak (though it was always very reassuring to know it was there, if I needed it, which I did far more than I ever anticipated).  My immediate concern was to find a replacement for the MacPro.

In a previous post I suggested that bioinformaticians, as a group, might be a little Mac-happy.  Even if I wanted one however, I couldn’t afford a high end MacPro on my budget.  My two year fellowship came with $5,000 in research funds so I needed to find the most computer that I could for < 5K (lab supplies will have to take care of themselves.  With luck my starter population of pipette tips will breed).  After a bit of shopping I ended up purchasing a custom built workstation from Puget Systems, a small company based outside of Seattle.  For just over $4,300 I got a 12 core machine with 32 Gigs of RAM and 8.5 Tb of storage, including a solid state boot drive.  That won’t be enough RAM or storage if all my pending projects bear fruit, but the system can handle 128 Gb RAM and 32 Tb of storage if I need to expand.  For the operating system I settled on Ubuntu for ease of use.  The box is purring away now but there was enough weirdness in the beginning that I’m not sure I’d go with Ubuntu again.  Redhat might have been a better solution as I need the workstation configured like a server, not a home computer.

Just for kicks I did a quick cost-comparison against a similar MacPro.  Apple’s gone a little crazy with the current generation of MacPros; I believe you can only get a single processor (up to 12 cores) and, consistent with Apple’s design philosophy, the storage has been moved outside the central box.  I’ve been told that the new configuration vastly improves I/O speeds but I’m not tech savvy enough to know how or why.  At any rate the “equivalent” MacPro comes in at $9,198.00 and, while sleek and sexy (minus the weird external storage module), lacks the expandability of the hulking box in my office.

In case it is of interest for other junior researchers on a budget here are the parts that went into my box from Puget Systems.  It took them only 3-4 days to build the system and get it out the door.  If you’ve got the money this basic system could be configured to a pretty decent computer – I believe the motherboard can support 2 x 18 Intel Xeon cores (oh I wish!) and 128 Gb RAM.

Parts for a ~$4,300 workstation for mid-weight bioinformatics.

Parts for a ~$4,300 workstation for mid-weight bioinformatics.

 

Posted in Research | Leave a comment

Clustering metagenomic sequence reads

Another interesting paper caught my eye last week, Nielsen et al. in Nature Biotechnology; Identification and assembly of genomes and genetic elements in complex metagenomic samples without using a reference genome. First, a complaint: 53 authors, really?  There are more offensive papers out there in this regard, but come on guys, not everyone in the research group needs be listed.  I could be totally wrong on this, but it seems unlikely that everyone on the list touched the data or contributed in a way that justifies authorship.  If I’m wrong, my sincere apologies, and more power to them!  If I’m not wrong then it would be nice of the top tier journals, where author bloat is most prevalent, to start policing this a little more aggressively.

Moving beyond that I’ve got warm feelings toward this particular (if very large) list of authors, as Nik Blom and others got me started in biological sequence analysis with a lab rotation at DTU back in 2010 (and are co-authors on one of the papers that came from my dissertation work).  The Nielsen et al. paper tackles one of the more vexing and important questions regarding biological sequence analysis: Given a metagenome, how can the reads be clustered by genome?  Solving this problem allows the metagenome to be more than a tool for evaluating community metabolic potential, as large genomic fragments can be used to probe deeper questions of microbial evolution, diversity, and function.

The problem addressed by the paper isn’t novel; numerous previous papers have reported genomes assembled from metagenome.  All of these used some read-clustering method to group like-reads or like-contigs prior to assembly.  A nice example of this is given by Iverson et al., 2012.  They used coverage statistics, GC content, and other metrics to bin contigs, and ultimately tetranucleotide information (4-base kmer abundance) to group like-scaffolds into a (nearly) complete genome of an uncultured group II Euryarchaeota.  There are now a number of papers out that use a philosophically similar approach, often relying on an emerging self-organizing map (ESOM) to perform the actual clustering.  This approach seems to work reasonably well, but it is understood that reads are binned at very low resolution with each cluster containing bits from vaguely similar genomes.  There is no ability to distinguish between ecotypes or related strains.

Nielsen et al. improved upon these methods by clustering genes based on abundance, they call the resulting clusters CAGs.  Applying this method to 396 human gut metagenomes, they assembled 238 unique genomes.  If the method survives the scrutiny of the community it represents a big step forward in throughput.  Here’s what they did in a little more detail.

The authors started by conducting a de novo assembly of the metagenomes into open reading frames (ORFs).  Because ORFs are short this is a relatively simple task.  Picking an ORF at random, they then searched the dataset for ORFs with a similar abundance across all samples (having a large number of samples is essential for this approach).  ORFs with a similar abundance profile were considered to be from the same genetic element and were called canopies.  Canopies of very rare or poorly distributed ORFs were rejected, the remaining canopies were considered to represent genomes and plasmids, as illustrated by the bimodal size distribution in their Fig. 2a:

fig_2a

 

 

Taking these canopies as probable genetic entities, the authors then mapped the original sequence reads to each canopy and assembled each pool of reads.  Overall a pretty slick method, assuming one has 396 somewhat-similar samples to analyze!  I’m curious to know whether a single deeply sequenced sample could be randomly partitioned into virtual samples for a similar analysis.  This would make the method available to us mere mortals without 396 metagenomes at our disposal…

 

 

 

Posted in Research | Leave a comment

Great modeling paper published

A very nice paper in the ISME Journal came across my Google Scholar alerts this week – Satellite remote sensing data can be used to model marine microbial metabolite turnover, by Larsen et al.  The author list includes some heavy hitters in the field of microbial ecology, including Rob Knight and Jack Gilbert.  The paper is significant for being, as far as I’m aware, the first study to quantitatively link microbial taxonomy, metabolic potential, and environmental parameters in a predictive manner.  This is something of a holy grail for microbial ecologists because, while microbial taxonomy and metabolic potential are difficult to measure, and can only be measured at discrete times and places in a study region, some environmental parameters (chlorophyll, sea surface temperature, etc.) are easily and near-continuously measured by satellite across a broad study region.  Robustly correlating to these easily-observed parameters allows the prediction of microbial community composition and the resulting metabolome (pool of material originating from the microbial community) from some future set of environmental parameters.  Here’s a quick summary of what they did.

Focusing on the Western English Channel, an ecologically very well characterized site, the authors developed spatially contiguous data for dissolved oxygen, phosphate, nitrate, ammonium, silicate, chlorophyll A, photosynthetically active radiation, particulate organic carbon in small, medium, large, and semi-labile classes, and bacterial abundance.  Most of this data is not observable via satellite, but all of it is much easier to collect than it is to determine microbial community structure or metabolic potential directly.  It’s my understanding that the authors first correlated the parameters that could not be measured by satellite to those that could, and then used these models to construct the contiguous datasets.  They then use a microbial assemblage prediction (MAP) model (see Larsen et al., 2012) to, well, predict the microbial assemblage.  If I understand correctly this also works off correlations, via a neural network algorithm that identifies linkages between different taxa and environmental parameters (previously measured at the same time and place in the environment).  The next step is to predict the metabolic potential (genetic contents) of the microbial community.  The authors do this at the order level to sidestep some of the issues with genetic plasticity at finer taxonomic scales.  By taking all the available genomes for each predicted order, they can estimate its genetic contents.  This is pretty hand-wavy, as many functions are not shared between all members of an order, but it’s a good place to start.

The next step is where the rubber meets the road.  The authors attempt to connect metabolic potential to the community metabolome.  They do this by using the KEGG database to identify reaction products associated with the enzymes comprising the community metabolic potential.  To validate this approach the authors focus on the one metabolite for which there is abundant data; CO2.

I’m pretty excited about what they’ve done, it’s a great start to elucidating how microbial community structure interacts with the environment and vice-verse.  There are however, a number of caveats to this analysis.  First, as touched on earlier, there is a huge issue with genomic plasticity in both prokaryotic and eukaryotic marine microbes.  Genomes from very closely related taxa, even from the sames species, can differ by 40 % or more.  That is a lot of metabolic function that is present in one cell but not the other.  Another issue is phenotypic plasticity, which is the ability of a cell to run different pieces of “software” on its genetic “hardware”.  By expressing different combinations of genes for example, a cell can achieve a much higher level of phenotypic diversity than one might think by looking just at the contents of its genome.  As we are all well aware, software is easily broken or lost.  Thus even if genomic plasticity is at a minimum for a certain taxa, there is no guarantee that all its members well respond in a like manner to the same set of environmental conditions.  Even clonal cells are often observed to drift apart phenotypically long before their genomes diverge.  It will take a lot more phenotype-aware data collections (e.g. transcriptomics, proteomics, and metabolomics) to sort out the true impact of community structure on the environment.

Posted in Research | Leave a comment

Making maps in R

I don’t have a lot of experience with geospatial data analysis but following a recent cruise I had the need to plot some geospatial data.  The go-to program for this seems to be Matlab (or even ArcGIS, for those who are serious about their map making), but I do almost all of my plotting in R.  This sent me on a bit of an exploration of the various map making options for R.  Although the maturity of the various R geospatial libraries is far below that of commercial products, it’s possible to make some pretty decent plots.  Here’s a synopsis of the method I ended up using.

To plot data on a map you first need a map to work from.  A map comes in the form of a shape file, which is nothing more than an collection of lines or shapes describing the outline of various map features.  Shape files come in various flavors, including lines and polygons.  For reasons that will become clear working with polygons is much preferable to working with lines, but I found it difficult (impossible) to consistently get the right polygon format into R.  R does have some built in polygons in the maps and mapdata packages.  Using these it is possible to access a global polygon map, however the resolution of the map is quite low.  The documentation doesn’t provide a maximum resolution, but it is low enough to be of little value on scales below hundreds of miles.  Here’s a quick plot of our study area (the Antarctic Peninsula) with no further information plotted, just to get a sense of the resolution.

library(maps)
library(mapdata)
map(database = 'world',
    regions = "antarctica",
    xlim = c(-77, -55),
    ylim = c(-70, -60),
    fill = T,
    col = 'grey',
    resolution = 0,
    bg = 'white',
    mar = c(1,1,2,1)
)

box()

basic_plot

 Not too bad, but the y dimension is over 1000 km.  If our map covered 10 km or even 100 km we’d have at best a large block to define any landmass.  To produce a more detailed map we need to use a shape file with a much higher resolution.  This is where things get a little tricky.  There are high resolution shape files available from a number of sources, primarily government agencies, but finding them can be difficult.  After a lot of searching I discovered the GSHHG High-resolution Geography Database available from NOAA.  The database is quite large, but it can be subset with the GEODAS Coastline Extractor which also makes use of the WDBII database (included with the GSHHG link) containing rivers and political boundaries.  It takes a little fussing around, but once you get the hang of the coastline extractor it isn’t too bad to work with.  On first use you have to point the extractor to the databases.  Then select “plot” from the drop down menu, and enter the desired lat and long range.  This returns a nice graphic of the selected region which you can export in the desired shape file format using “Save as”.

The export step is where I’m a bit stuck.  Ideally you would export the coastline as a polygon so that R could color it, if desired (as in the previous example).  I can only get R to recognize shape files containing lines however, exported in the ESRI format.  Here’s the region plotted earlier but now from the GSHHG database:

library(maps)
library(mapdata)
library(maptools)
plter_shape <- readShapeLines('plter_region_shore_shore',
                              repair = T)
map("world",
    "antarctica",
    xlim = xlim,
    ylim = ylim,
    fill = T,
    col = 'grey',
    resolution = 0,
    bg = 'white',
    mar = c(6,3,2,1),
    type = 'n'
)

plot(plter_shape, add = T)
box()

Which produces this plot:

Rplot02

That’s quite a bit more detail.  If we zoomed in the improvement would be even more pronounced.  You’ll notice that I plotted the ESRI shape file on a plot produced from maps, even though I had no intention of using maps to reproduce the coastline.  Using maps allows R to plot the data as a projection, which is particularly important at high latitudes.  The default maps projection isn’t great for 60 degrees south (see the maps documentation for a list of available projections), but it gets the point across.  A square projection would be much worse.  What’s particularly nice about all this is that R will use the projected coordinates for all future calls to this plot.  Points or lines added to the plot will thus show up in the right place.

Now that we have our basic map we can add some data.  One of our science tasks was to evaluate bacterial production along the Palmer Long Term Ecological Research Station grid.  Plotting depth-integrated values on our map lets us see where bacterial production was high.  I’m not going to bother showing how I arrived at depth integrated values because the code is too specific to my data format, but suffice to say that I used the interp1 function to interpolate between depth points at each station, then I summed the interpolated values.  This gives me a data frame (depth_int) of three variables: lon, lat, and bp (bacterial production).

First, let’s generate some colors to use for plotting:

z <- depth_int$bp
zcol <- colorRampPalette(c('white', 'blue', 'red'))(100)[as.numeric(cut(z, breaks = 100))]

Then we add the points to the plot.  This is no different than building any other R plot:

points(depth_int$lon,
       depth_int$lat,
       bg = zcol,
       pch = 21,
       cex = 2,
       col = 'grey'
)

Snazz it up a little by adding grid lines and axis labels:

axis(1, labels = T)
axis(2, labels = T)
grid()

Titles…

title <- expression(paste('Bacterial production gC m'^{-2},'d'^{-1}, sep = " "))
title(xlab = 'Longitude',
      ylab = 'Latitude',
      main = 'title')

And the coup de grâce, a scale bar.  I can’t find a good scale bar function, but it’s easy enough to build one from scratch:

## set the location and the colorbar gradation
xleft <- -74
xright <- -73
ybot <- -65
yint <- (65 - 62.5) / 100
ytop <- ybot + yint

## create the bar by stacking a bunch of colored rectangles
for(c in colorRampPalette(c('white', 'blue', 'red'))(100)){
  ybot = ybot + yint
  ytop = ytop + yint
  rect(xleft, ybot, xright, ytop, border = NA, col = c)
  print(c(xleft, xright, ybot, ytop, c))
}

## generate labels
labels <- round(seq(min(z), max(z), length.out = 5),2)

## add the labels to the plot
text(c(xright + 0.2),
     seq(-65, -62.5, length.out = 5),
     labels = as.character(labels),
     cex = 0.8,
     pos = 4)

Done correctly this should approximate the plot published in my earlier post.  Plenty of tweaks one could make (I’d start by shifting the location of the scale bar and drawing an outlined white box around it), but it’s not a bad start!

Rplot03

Posted in Research | 3 Comments

Converting pfams to cogs

Anyone with a sequence to classify faces a bewildering array of database choice.  You’ve got your classic NCBI quiver; nt, nr, refseq, cdd, etc.  Then you’ve got uniprot, seed, pfam, and numerous others.  Some of these have obvious advantages or disadvantages.  The NCBI nr database for example, is comprehensive but cumbersome.  Choosing between others can at times feel arbitrary.

One database that I use a lot is pfam.  I like it because, in addition to being directly supported by the excellent protein classifier HMMER3.0, it is well curated and has helpful Wikipedia-like articles for some of the protein families.  Classification is based on the presence of conserved domains which is also a little easier to wrap your head around than the implications of sequence similarity alone.  The downside to pfam is that the presence of a conserved domain doesn’t always clue you in to the function of an unknown protein sequence.  There are also many thousands of protein domains. Sometimes you can guess at the function from the name.  Aldedh for example, is the aldehyde dehydrogenase family. Other names are more obtuse.  Abhydrolase_5 is a family containing the alpha/beta hydrolase fold domain, but unless you’re a biochemist the implications of that are probably a mystery (it was to me, fortunately there are the wiki articles).  What’s needed is a broader level of classification for these protein families.

NCBI used to maintain a really great database called COG (Clusters of Orthologous Genes).  Proteins were grouped into COGs which were given descriptive names like Glyceraldehyde-3-phosphate dehydrogenase/erythrose-4-phosphate dehydrogenase.  Even better the COGs were assigned letters based on their perceived physiological function.  Some might argue that this is too coarse a classification, and they might be right, but it is still helpful when evaluating the ecological relevance of a protein.  Here are the COG codes, taken from the fun.txt file at the COG ftp site:

INFORMATION STORAGE AND PROCESSING
[J] Translation, ribosomal structure and biogenesis
[A] RNA processing and modification
[K] Transcription
[L] Replication, recombination and repair
[B] Chromatin structure and dynamics

CELLULAR PROCESSES AND SIGNALING
[D] Cell cycle control, cell division, chromosome partitioning
[Y] Nuclear structure
[V] Defense mechanisms
[T] Signal transduction mechanisms
[M] Cell wall/membrane/envelope biogenesis
[N] Cell motility
[Z] Cytoskeleton
[W] Extracellular structures
[U] Intracellular trafficking, secretion, and vesicular transport
[O] Posttranslational modification, protein turnover, chaperones

METABOLISM
[C] Energy production and conversion
[G] Carbohydrate transport and metabolism
[E] Amino acid transport and metabolism
[F] Nucleotide transport and metabolism
[H] Coenzyme transport and metabolism
[I] Lipid transport and metabolism
[P] Inorganic ion transport and metabolism
[Q] Secondary metabolites biosynthesis, transport and catabolism

POORLY CHARACTERIZED
[R] General function prediction only
[S] Function unknown

I decided to try and map the pfam database to the COG database so that I could have my cake and eat it too.  To do that I ran blastp on the Pfam-A.fasta file against a protein blast database created from cog0303.fa and used the following script to parse the tab-delimited blast output and various other files to create the mapping.  This was a reasonably large blast job and I ran into a computer issue part way through.  As a result the job didn’t complete.  When I have the chance to re-run I’ll post the completed pfam to cog table here. In the meantime here’s the script for anyone who would like to create their own pfam to cog map.  It is assumed that all database files are in the working directory.

## first step was to blast pfam-A against cog0303.fa
## blastp -db COG/cog0303 -query Pfam-A.fasta -out Pfam-A_cog.txt -outfmt 6 -num_threads 8 -evalue 1e-5
## if you want to conduct a new mapping, for example from a new blast file, remove the old pickle first

import gzip
import cPickle

cog_cats = {}
cogs_seqs = {}
cog_names = {}
pfam_seqs = {}
pfam_cog = {}
import os

if 'pfam_cog_dict.p' not in os.listdir('.'):
     ## map cog name to cog category
     print 'mapping cog name to cog category'
     with open('cogs.csv', 'r') as cog_file:
     for line in cog_file:
          line = line.rstrip()
          line = line.split(',')
          cog_cats[line[0]] = line[1]
          cog_names[line[0]] = line[2]

     ## map cog sequence to cog name
     print 'mapping cog sequence to cog name'
     with open('domdat.csv', 'r') as cogs_seqs_file:
          for line in cogs_seqs_file:
          line = line.rstrip()
          line = line.split(',')
          cogs_seqs[line[0]] = line[6]

     ## map pfam sequence to pfam
     print 'mapping pfam sequence to pfam'
     with open('Pfam-A.fasta', 'r') as pfam_seqs_file:
          for line in pfam_seqs_file:
          if line.startswith('>'):
          line = line.strip('>')
          line = line.rstrip()
          line = line.split()
          pfam_seqs[line[0]] = line[2]

     ## map blast
     print 'mapping blast'
     with gzip.open('Pfam-A_cog.txt.gz', 'rb') as blast:
          for line in blast:
          line = line.rstrip()
          line = line.split()
          pfam_seq = line[0]
          cog_seq = line[1]
          pfam = pfam_seqs[pfam_seq]
          cog = cogs_seqs[cog_seq]
          try:
               cog_name = cog_names[cog]
          except KeyError:
               cog_name = 'NULL'
          try:
               cog_cat = cog_cats[cog]
          except KeyError:
               cog_cat = 'NULL'
          try:
               temp = pfam_cog[pfam]
               temp_cog = temp[0]
               temp_cog_cat = temp[1]
               temp_cog.add(cog)
               temp_cog_cat.add(cog_cat)
               temp_cog_name = temp[2]
               temp_cog_name.add(cog_name)
               pfam_cog[pfam] = temp_cog, temp_cog_cat, temp_cog_name
          except KeyError:
               temp_cog = set()
               temp_cog_cat = set()
               temp_cog.add(cog)
               temp_cog_cat.add(cog_cat)
               temp_cog_name = set()
               temp_cog_name.add(cog_name)
               pfam_cog[pfam] = temp_cog, temp_cog_cat, temp_cog_name
          print pfam, cog, cog_cat, cog_name

     cPickle.dump(pfam_cog, open('pfam_cog_dict.p', 'wb'))

else:
     print 'using existing pfam to cog map'
     pfam_cog = cPickle.load(open('pfam_cog_dict.p', 'rb'))

     with open('pfam_cog_cat_map.txt', 'w') as output:
          for pfam in pfam_cog.keys():
          temp = pfam_cog[pfam]
          pfam_temp = pfam.split(';')
          pfam_code = pfam_temp[0]
          pfam_code = pfam_code.split('.')
          pfam_code = pfam_code[0]
          for cog_cat in list(temp[1]):
               print >> output, pfam_code+'\t'+pfam_temp[1]+'\t'+cog_cat

This script produces a pickle with the mapping in case you want to format the output differently later without having to redo everything, and a tab-delimited file with the COG categories that each pfam belongs to:

PF07792     Afi1     NULL
PF05346     DUF747     NULL
PF12695     Abhydrolase_5     E
PF12695     Abhydrolase_5     I
PF12695     Abhydrolase_5     FGR
PF12695     Abhydrolase_5     Q
PF12695     Abhydrolase_5     R
PF12695     Abhydrolase_5     U
PF12695     Abhydrolase_5     NULL
PF12644     DUF3782     S

Note that each pfam belongs to many COGS (maybe too many cogs to make this useful in many cases).  Hopefully it will help clarify the role of some of these pfams however!

Posted in Research | Leave a comment

Where goes the carbon in the Palmer LTER?

In the last post I broadly described some of the ecological changes underway in the Palmer LTER.  If we consider only the biological components of any ecosystem (excluding the chemical and physical components) we can diagram their interactions as a food web, with primary producers at the base of the web and top predators at the top.  Changes to the structure of the food web are generally driven by changes at the top (‘top-down’ effects) or at the bottom (‘bottom-up’ effects).  Whaling during the 19th century for example, exerted a very strong top-down effect on the structure of Antarctic food webs.  Rapidly changing sea ice conditions during this century are exerting a strong bottom-up effect by altering the pattern of primary production.

The classic Antarctic food web. The traditional food web is a good learning tool, but is incomplete. In reality a substantial amount of biomass cycles through the marine microbial community. Taken from http://science.jrank.org/kids/pages/63/FITTING-ALL-TOGETHER.html.

The traditional food web that we all learned in elementary school is (not surprisingly) an incomplete representation of the trophic structure of an ecosystem.  If we consider the carbon flowing through such a food web we see it move only in one direction, up from the primary producers to the top consumers.  A rule of thumb is that the biomass of each trophic level is one tenth of the one below it, so as carbon moves up through the food web most of it must be exiting the system.  Where does it go?  Much of it is lost through respiration – we and nearly all other multicellular organisms exhale carbon dioxide.  The rest is lost through inefficient feeding, excretion, and death in the consumer trophic levels.

A pelagic food web with the microbial loop. The diagram represents an open-ocean environment different from the west Antarctic Peninsula, but does a good job at highlighting the function of the microbial loop (at left). DOC leaks from the food web at every trophic level. Bacteria incorporate this DOC into microbial biomass and are consumed by small zooplankton, returning the carbon to the food web. Taken from http://oceanworld.tamu.edu/resources/oceanography-book/microbialweb.htm.

If that was the end of the story carbon would be raining down to the deep ocean at a rapid pace in the form of dead phytoplankton, dead consumers, and fecal pellets (this is actually the goal of ocean fertilization), and the photic zone, the sunlight portion of the ocean that supports photosynthesis, would be awash in small particles of organic carbon (we call this dissolved organic carbon or DOC) that aren’t practical for larger organisms to eat.

What limits this in the ocean are bacteria, which consume DOC and partially degrade detrital particles.  Bacteria are large enough to be prey items for some consumers, so the carbon they incorporate is recycled into the food web in a process known as the microbial loop.  Bacteria take up a huge amount of the carbon that leaves the food web, but not all of it is looped back up to the higher trophic levels.  Like multicellular organisms heterotrophic bacteria respire carbon dioxide.  If that carbon dioxide isn’t take back up by phytoplankton (a further loop embedded in the food web) it will eventually leave the system.

Depth integrated bacterial production for the Palmer LTER during the 2014 LTER cruise, estimated from the uptake of tritiated leucine.

Depth integrated bacterial production for the Palmer LTER during the 2014 LTER cruise, estimated from the uptake of tritiated leucine.

How much carbon is taken up by primary production, incorporated into bacterial biomass, and exported from the photic zone are huge research questions.  None of these questions are particularly easy to answer, and the numbers vary widely across time and space.  On the Palmer LTER cruise we used three different methods to estimate these values for the west Antarctic Peninsula.  I’ll talk a little about the first two here.  Bacterial production is measured by the uptake of tritiated leucine.  Leucine is an amino acid that is broadly taken up by the marine microbial community.  Tritium is a radioactive isotope of hydrogen, tritiated leucine is leucine with tritium in place of hydrogen atoms.  It is possible to quantitatively measure very small amounts of radioactivity, which makes it possible to measure the incorporation of very small amounts of tritiated leucine into microbial biomass.

To do this we take water samples from multiple depths at each station and incubate them with tritiated leucine at the in-situ temperature.  After a few hours the bacteria in the water samples are killed (“fixed”), carefully washed, and measured for radioactivity.  From this a rate of leucine incorporation is calculated.  This number tells us something about the rate at which bacteria are increasing their biomass.  What we actually want to know however, is how much carbon they are taking up to do this.  Many years of experiments have allowed the development of a conversion factor to calculate the rate of carbon incorporation from the rate of leucine incorporation.  What this method doesn’t tell us however, is the amount of carbon that is consumed during microbial respiration – a much harder number to get at!

Depth integrated primary production for the Palmer LTER during the 2014 LTER cruise. Estimated from the uptake of 14C bicarbonate.

Depth integrated primary production for the Palmer LTER during the 2014 LTER cruise. Estimated from the uptake of 14C bicarbonate.

The method for measuring primary production is similar to that for measuring bacterial production, except that radioactive bicarbonate (labeled with the 14C isotope of carbon) is used in place of radioactive leucine.  In seawater the bicarbonate exists as carbon dioxide and is incorporated into phytoplankton biomass during photosynthesis.

All of this fixed carbon exits the system in one of two ways.  Either it is respired as carbon dioxide as it works its way up the food web, eventually ending up in the atmosphere, or it is exported out of the photic zone as sinking detrital particles.  Particulate matter that is not consumed by bacteria below the photic zone will eventually end up on the seafloor.  The consumption of this carbon by the microbial community is often very high right at the seafloor, but rapidly diminishes below the sediment surface as oxidized chemical species are depleted.  Carbon that makes it to this depth is effectively sequestered, until a geological process (such as subduction) returns it to the surface.  Measuring the flux of particulate matter to the seafloor relies on the use of sediment traps or the measurement of radioactive thorium-234, a process that is far more complex than the measure of tritiated leucine or 14C bicarbonate.  To learn more about it check out the website of Café Thorium, Ken Buessler’s research group at WHOI, particularly this article.

The two figures above show our estimates of primary and bacterial production across the study area.  Both sets of values are pretty high.  There was a strong phytoplankton bloom this year (perhaps connected to improved sea ice coverage) and the bacterial community seems to have responded in kind to this input of fixed carbon.  Note however, that bacterial production is an order of magnitude below primary production.  Most of the carbon is exiting the system through respiration or particle export.  A small amount is contained within the biomass of phytoplankton, krill, and the higher trophic levels.  If you look carefully at the two plots you’ll also see that bacterial production is somewhat decoupled from primary production, being high where primary production is low and sometimes low when it is high.  The devil is in the detail and it will take some digging to understand the dynamics of carbon transfer in this system!

The bacterial abundance, production, and export team for the 2014 Palmer LTER cruise, on the deck of the ARSV Laurence M. Gould.  The odd looking contraption behind us is a French made sediment trap.  Sediment traps are invaluable tools, but famously undersample particles sinking to the seafloor.  The French design is supposed to minimize these effects.  Fun fact - tenured WHOI faculty are impervious to the cold and to head injuries.

The bacterial abundance, production, and export team for the 2014 Palmer LTER cruise, on the deck of the ARSV Laurence M. Gould. The odd looking contraption behind us is a French made sediment trap. Sediment traps are invaluable tools, but famously undersample particles sinking to the seafloor. The French design is supposed to minimize these effects. Fun fact – tenured WHOI faculty are impervious to the cold and to head injuries.

 

Posted in Research | Leave a comment

A cruise in the Palmer LTER

The Laurence M. Gould in one of the many coves off the West Antarctic Peninsula.

The Laurence M. Gould in one of the many coves off the West Antarctic Peninsula.

For the last five weeks I’ve been on a research cruise aboard the ARSV Laurence M. Gould off the west Antarctic Peninsula (WAP).  This region has received a lot of attention in recent years as one of the fastest warming places in the planet – annual temperatures have increased 7°C over the last 50 years.  Because of past sealing and whaling and ongoing fishing the WAP is hardly a pristine ecosystem (remember, say no to Chilean sea bass, aka the Patagonian toothfish, one of the most pressured predators in the region).  Nonetheless it’s under much less direct human pressure than most places on Earth which, combined with the rapid rate of warming, makes it a good site to observe the impact of changing climate on marine ecosystems.  To carry out these observations the WAP hosts a long term ecological research site (the Palmer LTER), one of 25 NSF funded LTER sites.  The Palmer LTER is unique in being one of only two marine-pelagic LTERs, and also one of only two LTERs located in Antarctica (the other is in the McMurdo Dry Valleys).  The cruise I participated in takes place each year during the Antarctic summer.  It is tasked with making a variety of measurements throughout the Palmer LTER and otherwise supporting science activities connected with long-term observations there.

Sample sites within the Palmer LTER. Each year the Gould occupies approximately three points spanning each “line”, depending on sea ice conditions.

The Laurence M. Gould has extremely limited access to the internet, so unfortunately it wasn’t possible to upload articles in real time.  Instead I’ll try to summarize the cruise in a couple of articles after the fact.

I mentioned that the climate of the WAP is changing fast.  Twenty years ago the whole peninsula would have been considered a polar environment, hosting a polar marine ecosystem.  The outlying islands that extend in a sweeping arc out to South Georgia Island would have been considered subpolar.  The subpolar ecosystem is now moving south in a hurry, changing the distribution of nutrients, primary production, and marine species.  Because it’s hard to picture what’s going on below the ocean’s surface it’s a little difficult to visualize these changes in the marine ecosystem.  A comparable change in the terrestrial environment might be the encroachment of desert on a region of rich grassland.  Deserts aren’t all bad (I rather like them), but they can’t support the biomass or diversity of grasslands.  In the parlance of economics they can’t support the same level of ecosystem services.

In the WAP a huge driver of ecological change is sea ice.  The species that occupy the Antarctic Peninsula are divided into two groups: ice dependent and ice independent.  The ice dependent species have an absolute requirement for sea ice and include Adélie penguins, leopard seals, crabeater seals, sea ice algae, and most importantly, krill.  The krill relationship with sea ice isn’t totally straightforward.  Adult krill are happy to feed on phytoplankton in the water column and have no real sea ice requirement.  Juvenile krill on the other hand, are dependent on the dense layer of lipid rich algae that grow on the underside of sea ice to support their growth.  No juvenile krill means no adult krill, a problem because everything else in the Antarctic marine ecosystem depends on adult krill.  Without krill there is no way to transfer biomass from primary producers at the base of the foodweb to higher trophic levels.  To return to the grassland analogy, removing the krill from the ecosystem would be like removing all the hoofed mammals from the Serengeti.  All the predators and scavengers would disappear too.   Ultimately even the grass would fail, because the presence of large grazers has benefits like soil aeration and nutrient and seed dispersal.  In the WAP it would be worse than this because the biomass far exceeds that of a grassland of comparable size, and this biomass fits critically into the ecology of such globe-trotting species as the humpback whale, orca, and Wilson’s storm petrel (thought by some to be the world’s most abundant bird).

Sea ice cover in the Palmer LTER.

Sea ice cover in the Palmer LTER from 1981 to 2011.  Anvers Island is at the northern extent of the LTER, Adelaide Island near the center, and Charcot Island at the southern extent.  There is a lot of year to year variation, but the overall trend of reduced ice cover is clear.  From Ducklow et al. 2013, see link at end of post.

The general trend in the WAP is for reduced sea ice and primary production in the north of the study region, and increased primary production in the south.   All the factors influencing this change aren’t yet clear, nor is it clear how these changes will manifest in the future.  It will take decades of observation to clarify the trend and the mechanisms behind it.  One likely driver of the southward shift in primary production is the reduced availability of micronutrients supplied by glacial meltwater.  The Peninsula, like the rest of the continent, is covered with glaciers.  Glaciers melt at their base where they contact rock, and this mineral-rich meltwater is a key source of iron and other micronutrients to the marine ecosystem.  It’s counterintuitive that warming would reduce the availability of meltwater.  The availability of these micronutrients have to do with the distribution of meltwater in the ocean however, not the rate of melt.  Freshwater is much less dense than seawater, so glacial meltwater “floats” overtop seawater, forming a micronutrient rich lens.  In the presence of sea ice this lens is protected from wind driven vertical mixing, and for a brief period each summer following sea ice retreat there is a strong spatial overlap between micronutrient rich water and sunlight.  The result is a “bloom”, a sudden explosion of growth among the primary producers.  In the absence of sea ice storms dilute this surface water with micronutrient poor water from deeper in the ocean.  By summer the amount of photosynthesis that can be supported is much reduced.

Leopard seals in the WAP, near the British Antarctic Survey's Rothera Station.  Leopard seals are a key ice dependent predator in the region.

Leopard seals in the WAP, near the British Antarctic Survey’s Rothera Station. Leopard seals are a key ice dependent predator in the region.

Nothing is that simple however.  In the WAP the upwelling of micronutrient poor deep water through marine canyons appears to strongly support primary production.  Like upwelling water the world over this deep water, while deficient in micronutrients, is rich in macronutrients like nitrate.  So the reality is probably that both water sources, and a reasonably stable water column, are required to support the level of primary production the rest of the ecosystem depends on.  Returning to the grassland analogy, tweaking the delivery and mixing rate of surface and deep water is equivalent to the jetstream shifting its course over land, depriving a historically wet region of water and limiting the amount of plant growth that it can support.

So, from the standpoint of primary production, two important things are happening in the WAP.  First, the reduction in sea ice means there are fewer ice algae to support the growth of juvenile krill.  Second, the reduction in sea ice is increasing the rate of vertical mixing in the water column, reducing the growth of phytoplankton – the food source of adult krill.

This year was an exception to the current trend, with ice conditions rebounding a little after years of decline.  It was nothing approaching the coverage that was normal 15 or 20 years ago, but it was sufficient to support an impressive phytoplankton bloom and will hopefully slow the loss of some of the ice dependent species like the Adélie penguin, whose numbers have been dwindling.  Next post I’ll talk a bit more about some of our surprising findings regarding the fate of all that carbon produced during the bloom…

For more on the Palmer LTER check out this excellent, non-technical article in the magazine of the Oceanography Society: West Antarctic Peninsula: an ice-dependent coastal marine ecosystem in transition

Posted in LMG14-01 | Leave a comment