New paper published in ISME Journal

I’m happy to report that a paper I wrote during my postdoc at the Lamont-Doherty Earth Observatory was published online today in the ISME Journal.  The paper, Bacterial community segmentation facilitates the prediction of ecosystem function along the coast of the western Antarctic Peninsula, uses a novel technique to “segment” the microbial community present in many different samples into a few groups (“modes”) that have specific functional, ecological, and genomic attributes.  The inspiration for this came when I stumbled across this blog entry on an approach used in marketing analytics.  Imagine that a retailer has a large pool of customers that it would like to pester with ads tailored to purchasing habits.  It’s too cumbersome to develop an individualized ad based on each customer’s habits, and it isn’t clear what combination of purchasing-habit parameters accurately describe meaningful customer groups.  Machine learning techniques, in this case emergent self-organizing maps (ESOMs), can be used to sort the customers in a way that optimizes their similarity and limits the risk of overtraining the model (including parameters that don’t improve the model).

In a 2D representation of an ESOM, the customers most like one another will be organized in geographically coherent regions of the map.  Hierarchical or k-means clustering can be superimposed on the map to clarify the boundaries between these regions, which in this case might represent customers that will respond similarly to a targeted ad.  But what’s really cool about this whole approach is that, unlike with NMDS or PCA or other multivariate techniques based on ordination, new customers can be efficiently classified into the existing groups.  There’s no need to rebuild the model unless a new type of customer comes along, and it is easy to identify when this occurs.

Back to microbial ecology.  Imagine that you have a lot of samples (in our case a five year time series), and that you’ve described community structure for these samples with 16S rRNA gene amplicon sequencing.  For each sample you have a table of OTUs, or in our case closest completed genomes and closest estimated genomes (CCGs and CEGs) determined with paprica.  You know that variations in community structure have a big impact on an ecosystem function (e.g. respiration, or nitrogen fixation), but how to test the correlation?  There are statistical methods in ecology that get at this, but they are often difficult to interpret.  What if community structure could be represented as a simple value suitable for regression models?

Enter microbial community segmentation.  Following the customer segmentation approach described above, the samples can be segmented into modes based on community structure with an Emergent Self Organizing Map and k-means clustering.  Here’s what this looks like in practice:

From Bowman et al. 2016.  Segmentation of samples based on bacterial community structure.  C-I show the relative abundance of CEGs and CCGs in each map unit.  This value was determined iteratively while the map was trained, and reflects the values for samples located in each unit (B).

This segmentation reduces the data for each sample from many dimensions (the number of CCG and CEG present in each samples) to 1.  This remaining dimension is a categorical variable with real ecological meaning that can be used in linear models.  For example, each mode has certain genomic characteristics:

From Bowman et al. 2016.  Genomic characteristics of modes (a and b), and metabolic pathways associated with taxa that account for most of the variations in composition between modes (d).

In panel a above we see that samples belonging to modes 5 and 7 (dominated by the CEG Rhodobacteraceae and CCG Dokdonia MED134, see Fig. 2 above) have the greatest average number of 16S rRNA gene copies.  Because this is a characteristic of fast growing, copiotrophic bacteria, we might also associate these modes with high levels of bacterial production.

Because the modes are categorical variables we can insert them right into linear models to predict ecosystem functions, such as bacterial production.  Combined with bacterial abundance and a measure of high vs. low nucleic acid bacteria, mode accounted for 76 % of the variance in bacterial production for our samples.  That’s a strong correlation for environmental data.  What this means in practice is; if you know the mode, and you have some flow cytometry data, you can make a pretty good estimate of carbon assimilation by the bacterial community.

For more on what you can do with modes (such as testing for community succession) check out the article!  I’ll post a tutorial on how to segment microbial community structure data into modes using R in a separate post.  It’s easier than you think…

Posted in Research | Leave a comment

paprica v0.4.0

I’m happy to announce the release of paprica v0.4.0.  This release adds a number of new features to our pipeline for evaluating microbial community and metabolic structure.  These include:

  • NCBI taxonomy information for each point of placement on the reference tree, including internal nodes.
  • Inclusion of the domain Eukarya.  This was a bit tricky and requires some further explanation.

The distribution of metabolic pathways, predicted during the creation of the paprica Eukarya database, across transcriptomes in the MMETSP.

Eukaryotic genomes are a totally different beast than their archaeal and bacterial counterparts.  First and foremost they are massive.  Because of these there aren’t very many completed eukaryotic genomes out there, particularly for singled celled eukaryotes.  While a single investigator can now sequence, assemble, and annotate a bacterial or archaeal genome in very little time, eukaryotic genomes still require major efforts by consortia and lots of $$.

One way to get around this scale problem is to focus on eukaryotic transcriptomes instead of genomes.  Because much of the eukaryotic genome is noncoding this greatly reduces sequencing volume.  Since there is no such thing as a contiguous transcriptome, this approach also implies that no assembly (beyond open reading frames) will be attempted.  The Moore Foundation-funded Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) was an initial effort to use this approach to address the problem of unknown eukaryotic genetic diversity.  The MMETSP sequenced transcriptomes from several hundred different strains.  The taxonomic breadth of the strains sequenced is pretty good, even if (predictably) the taxonomic resolution is not.  Thus, as for archaea, the phylogenetic tree and metabolic inferences should be treated with caution.  For eukaryotes there are the additional caveats that 1) not all genes coded in a genome will be represented in the transcriptome 2) the database contains only strains from the marine environment and 3) eukaryotic 18S trees are kind of messy.  Considerable effort went into making a decent tree, but you’ve been warned.

Because the underlying data is in a different format, not all genome parameters are calculated for the eukaryotes.  18S gene copy number is not determined (and thus community and metabolic structure are not normalized), the phi parameter, GC content, etc. are also not calculated.  However, eukaryotic community structure is evaluated and metabolic structure inferred in the same way as for the domains bacteria and archaea:

./paprica-run.sh test.eukarya eukarya

As always you can install paprica v0.4.0 by following the instructions here, or you can use the virtual box or Amazon Web Service machine instance.

Posted in paprica | Leave a comment

paprica on the cloud

This is a quick post to announce that paprica, our pipeline to evaluate community structure and conduct metabolic inference, is now available on the cloud as an Amazon Machine Instance (AMI).  The AMI comes with all dependencies required to execute the paprica-run.sh script pre-installed.  If you want to use it for paprica-build.sh you’ll have to install pathway-tools and a few additional dependencies.  I’m new to the Amazon EC2 environment, so please let me know if you have any issues using the AMI.

If you are new to Amazon Web Services (AWS) the basic way this works is:

  • Sign up for Amazon EC2 using your normal Amazon log-in
  • From the AWS console, make sure that your region is N. Virginia (community AMI’s are only available in the region they were created in)
  • From your EC2 dashboard, scroll down to “Create Instance” and click “Launch Instance”
  • Now select the “Community AMIs”
  • Search for “paprica-ec2”, then select the AMI corresponding to the latest version of paprica (0.4.0 at the time of writing).
  • Choose the type of instance you would like to run the AMI on.  This is the real power of AWS; you can tailor the instance to the analysis you would like to run.  For testing choose the free t2.micro instance.  This is sufficient to execute the test files or run a small analysis (hundreds of reads).  To use paprica’s parallel features select an instance with the desired number of cores and sufficient memory.
  • Click “Review and Launch”, and finish setting up the instance as appropriate.
  • Log onto the instance, navigate to the paprica directory, execute the test file(s) as described in the paprica tutorial.  The AMI is not updated as often as paprica, so you may wish to reclone the github repository, or download the latest stable release.
Posted in paprica | 1 Comment

Antarctic Long Term Ecological Research

stuff

The Palmer and McMurdo Long Term Ecological Research (LTER) projects; separate but equal… at least in terms of interesting ecosystem dynamics if not in terms of biomass!

I’m very excited that our manuscript “Microbial community dynamics in two polar extremes: The lakes of the McMurdo Dry Valleys and the West Antarctic Peninsula Marine Ecosystem” has been published as an overview article in the journal BioScience.  The article belongs to a special issue comparing different ecological aspects of the two NSF-funded Long Term Ecological Research (LTER) sites in Antarctica.  I’m actually writing this post on my return trip from the first ever open science meeting of the International Long Term Ecological Research (ILTER) network at Kruger National Park in South Africa (an excellent place to ponder ecological questions).

This article had an odd genesis; the special issue was conceived by John Priscu, a PI with the McMurdo LTER project.  I was ensnared in the project along with Trista Vick-Majors, a graduate student with John Priscu (now a postdoctoral scholar at McGill University), shortly after starting my postdoc with Hugh Ducklow, PI on the Palmer LTER project.  The guidance we received was more or less “compare the McMurdo and Palmer LTERs”.  How exactly we should compare perennially ice-covered lakes in a polar desert to one of the richest marine ecosystems on the planet was left up to us.  Fortunately, microbial ecology lends itself to highly reductionist thinking.   This isn’t always helpful, but we reasoned that on a basal level the two ecosystems must function more or less the same.  Despite dramatically different physical settings, both environments host communities of phytoplankton (sometimes even similar taxonomic groups).  These convert solar energy into chemical energy and CO2 into organic carbon, thereby supporting communities of heterotrophic bacteria and grazers.

To look at the details of this we stretched the bounds of what constitutes an “overview article” and aggregated nearly two decades of primary production and bacterial production data collected by the McMurdo LTER, and over a decade of the same from the Palmer LTER.  By looking at the ratio of bacterial production to primary production we assessed how much carbon the heterotrophic bacterial community takes up relative to how much the phytoplankton community produces.

Some stuff

Figure from Bowman et al., 2016, BioScience.  A) Depth-integrated bacterial (BP) and primary production (PP) for the Palmer LTER study area and several lakes in Taylor Valley.  B)  The region occupied by the mean and standard deviation for discrete points (too many points to show).  C) The distribution of BP:PP for each site.

Typical marine values for this ratio are 1:10.  At a value of around 1:5 the carbon demands of heterotrophic bacteria are probably not met by phytoplankton production (the majority of carbon taken up by bacteria is lost through respiration and is not accounted for in the bacterial production assay).  Most of the lakes hover around 1:5, with values above this fairly common.  Lake Fryxell however, an odd lake at the foot of Canada Glacier, has values that often exceed 1:1!  Consistent with previous work on the lakes such high rates of bacterial production (relative to primary production) can only be met by a large external carbon subsidy.

Where does this external carbon come from?  Each summer the McMurdo Dry Valleys warm up enough that the various glaciers at the valley peripheries begin to melt.  This meltwater fuels chemoautotrophic bacterial communities where the glacier meets rock (the subglacial environment), and microbial mats in various streams and melt ponds.  Like microbial communities everywhere these bleed a certain amount of dissolved carbon (and particulate; DOC and POC) into the surrounding water.  Some of this carbon ends up in the lakes where it enhances bacterial production.

But external carbon subsidies aren’t the only part of the story.  Nutrients, namely phosphate and nitrate, are washed into the lakes as well.  During big melt years (such as the summer of 2001-2002 when a major positive SAM coupled to an El Nino caused unusually high temperatures) the lakes receives big pulses of relatively labile carbon but also inorganic nutrients and silt.  This odd combination has the effect of suppressing primary production in the near term through lowered light levels (all that silt), enhancing it in the long term (all those nutrients), and giving heterotrophic bacteria some high quality external carbon to feed on during the period that primary production is suppressed.  Or at least that’s how we read it.

Not a lake person?  How do things work over in the Palmer LTER?  One of the biggest ecological differences between Palmer and McMurdo is that the former has grazers (e.g. copepods, salps, and krill) and the latter does not, or at least not so many to speak off.  Thus an argument can be made that carbon dynamics at Palmer are driven (at least partially) by top-down controls (i.e. grazers), while at McMurdo they are dependent almost exclusively on bottom-up (i.e. chemical and physical) controls.

At times the difference between bacterial production and primary production is pretty extreme at Palmer.  In the summer of 2006 for example, bacterial production was only 3 % of primary production (see Fig. 4 in the publication), and the rate of primary production that summer was pretty high.  The krill population was also pretty high that year; at the top of their 4-year abundance cycle (see Saba et al. 2014, Nature Communications).  This is speculative, but I posit that bacterial production was low in part because a large amount of carbon was being transferred via krill to the higher trophic levels and away from bacteria.  This is a complicated scenario because krill can be good for bacteria; sloppy feeding produces DOC and krill excrete large amounts of reduced nitrogen and DOC.  Krill also build biomass and respire however, and their large fecal pellets sink quickly, these could be significant losses of carbon from the photic zone.

Antarctica is changing fast and in ways that are difficult to predict.  Sea ice seems to be growing in the east Antarctic as it is lost from the west Antarctic, and anomalous years buck this trend in both regions.  A major motivation for this special issue was to explore how the changing environment might drive ecological change.  I have to say that after spending a good portion of the (boreal) summer and fall thinking about this, some of that time from the vantage point of Palmer Station, I have no idea.  All of the McMurdo Lakes react differently to anomalous years, and Palmer as a region seems to react differently to each of abnormal year.  I think the krill story is an important concept to keep in mind here; ecological responses are like superimposed waveforms.  Picture a regularly occurring phenomenon like the El-Nino Southern Oscillation imposing a periodicity on sea ice cover, which we know has a strong influence on biology.  Add a few more oscillating waves from other physical processes.  Now start to add biological oscillations like the four-year krill abundance cycle.  Can we deconvolute this mess to find a signal?  Can we forecast it forward?  Certainly not with 10 years of data at one site and 20 years at the other (and we’re so proud of these efforts!).  Check back next century… if NSF funds these sites that long…

Many thanks to my co-authors for going the distance on this paper, particularly the lake people for many stimulating arguments.  I think limnology and oceanography are, conceptually, much less similar than lakes and oceans.

Posted in Research | 1 Comment

Creating a landmask for the West Antarctic Peninsula in R

presentation_graphic

Silicate concentration in µM at 1m depth during the 2014 Palmer LTER cruise.  This plot is lying to you.  The interpolations extend past islets and into landmasses.

This is going to be a pretty niche topic, but probably useful for someone out there.  Lately I’ve been working with a lot of geospatial data for the West Antarctic Peninsula.  One of the things that I needed to do was krig the data (krigging is a form of 2D interpolation, I’m using the pracma library for this).  Krigging is a problem near coastlines because it assumes a contiguous space to work in.  If there happens to be an island or other landmass in the way there is no way to represent the resulting discontinuity in whatever parameter you’re looking at.  Because of this I needed to find a way to mask the output.  This doesn’t really solve the problem, but at least it allows me to identify areas of concern (for example interpolation that extends across an isthmus, if there are sample points only on one side).

I’m krigging and building maps entirely inside R, which has somewhat immature packages for dealing with geospatial data.  The easiest masking solution would be to use filled polygons from any polygon format shapefile that accurately represents the coastline.  Unfortunately I couldn’t find an R package that does this correctly with the shapefiles that I have access too.  In addition, because of my downstream analysis it was better to mask the data itself, and not just block out landmasses in the graphical output.

Sharon Stammerjohn at the Institute of Arctic and Alpine Research pointed me to the excellent Bathymetry and Global Relief dataset produced by NOAA.  This wasn’t a complete solution to the problem but it got me moving in the right direction.  From the custom grid extractor at http://maps.ngdc.noaa.gov/viewers/wcs-client/ I selected a ETOPO1 (bedrock) grid along the WAP, with xyz as the output format.  If you’re struggling with shapefiles the xyz format is like a cool drink of water, being a 3-column matrix of longitude, latitude, and height (or depth).  For the purpose of creating the mask I considered landmass as any lat-long combination with height > 0.

There is one more really, really big twist to what I was trying to do, however.  The Palmer LTER uses a custom 1 km pixel grid instead of latitude-longitude.  It’s a little easier to conceptualize than lat-long given the large longitude distortions at high latitude (and the inconvenient regional convergence of lat-long values on similar negative numbers).  It is also a little more ecologically relevant, being organized parallel to the coastline instead of north to south.  Unfortunately this makes the grid completely incompatible with other Euclidean reference systems such as UTM.  So before I could use my xyz file to construct a land mask I needed to convert it to the line-station grid system used by the Palmer LTER.  If you’re working in lat-long space you can skip over this part.

grid

The Palmer LTER grid provides a convenient geospatial reference for the study area, but converting between line (y) and station (x) coordinates and latitude-longitude is non-trivial.

Many moons ago someone wrote a Matlab script to convert lat-long to line-station which you can find here.  Unfortunately I’m not a Matlab user, nor am I inclined to become one.  Fortunately it was pretty straightforward to copy-paste the code into R and fix the syntatic differences between the two languages.  Three functions in total are required:

## AUTHORS OF ORIGINAL MATLAB SCRIPT:
#   Richard A. Iannuzzi
#   Lamont-Doherty Earth Observatory
#   iannuzzi@ldeo.columbia.edu
#   based on: LTERGRID program written by Kirk Waters (NOAA Coastal Services Center), February 1997

## some functions that are used by the main function

SetStation <- function(e, n, CENTEREAST, CENTERNORTH, ANGLE){
  uu = e - CENTEREAST
  vv = n - CENTERNORTH
  z1 = cos(ANGLE)
  z2 = sin(ANGLE)
  NorthKm = (z1 * uu - z2 *vv) / 1000 + 600
  EastKm = (z2 * uu + z1 * vv) / 1000 + 40
  
  return(c(NorthKm, EastKm))
}

CentralMeridian <- function(iz){
  if(abs(iz) > 30){
    iutz = abs(iz) - 30
    cm = ((iutz * 6.0) -3.0) * -3600
  }
  else{
    iutz = 30 - abs(iz)
    cm = ((iutz * 6.0) +3.0) * +3600
  }
  return(cm)
}

GeoToUTM <- function(lat, lon, zone){

  axis = c(6378206.4,6378249.145,6377397.155,
          6378157.5,6378388.,6378135.,6377276.3452,
          6378145.,6378137.,6377563.396,6377304.063,
          6377341.89,6376896.0,6378155.0,6378160.,
          6378245.,6378270.,6378166.,6378150.)
  
  bxis = c(6356583.8,6356514.86955,6356078.96284,
          6356772.2,6356911.94613,6356750.519915,6356075.4133,
          6356759.769356,6356752.31414,6356256.91,6356103.039,
          6356036.143,6355834.8467,6356773.3205,6356774.719,
          6356863.0188,6356794.343479,6356784.283666,
          6356768.337303)
  
  ak0 = 0.9996
  radsec = 206264.8062470964  
  sphere = 9
  
  a = axis[sphere - 1]             # major axis size
  b = bxis[sphere - 1]             # minior axis size
  es = ((1-b^2/a^2)^(1/2))^2       # eccentricity squared
  slat = lat * 3600                # latitude in seconds
  slon = -lon * 3600               # longitude in seconds
  cm = 0                           # central meridian in sec
  iutz = 0
  
  cm = CentralMeridian(zone)       # call the function
  
  phi = slat/radsec
  dlam = -(slon - cm)/radsec
  epri = es/(1.0 - es)
  en = a/sqrt(1.0 - es * sin(phi)^2)
  t = tan(phi)^2
  c = epri * cos(phi)^2
  aa = dlam * cos(phi)
  s2 = sin(2.0 * phi)
  s4 = sin(4.0 * phi)
  s6 = sin(6.0 * phi)
  f1 = (1.0 - (es/4.)-(3.0*es*es/64.)-(5.0*es*es*es/256))
  f2 = ((3*es/8)+(3.0*es*es/32)+(45*es*es*es/1024))
  f3 = ((15*es*es/256)+(45*es*es*es/1024))
  f4 = (35*es*es*es/3072)
  em = a*(f1*phi - f2*s2 + f3*s4 - f4*s6)
  xx = ak0 * en * (aa + (1.-t+c) * aa^3/6 + (5 - 18*t + t*t + 72*c-58*epri)* aa^5/120) + 5e5
  yy = ak0 * (em + en * tan(phi) *((aa*aa/2) + (5-t+9*c+4*c*c)*aa^4/24 + (61-58*t +t*t +600*c - 330*epri)* aa^6/720))
  
  if(zone < 0 | slat < 0){
    yy = yy + 1e7
  }
  
  return(c(xx, yy))
}

## This function actually works with your data

ll2gridLter <- function(inlat, inlon){
  NorthKm = 0           # initialize
  EastKm = 0            # initialize
  zone = -20            # set zone (for LTER region, I think)
  ANGLE = -50 * pi / 180
  CENTEREAST = 433820.404        # eastings for station 600.040
  CENTERNORTH = 2798242.817      # northings for station 600.040
  
  # take latitude longitude and get station
  x.y = GeoToUTM(inlat, inlon, zone)
  NorthKm.EastKm = SetStation(x.y[1], x.y[2], CENTEREAST, CENTERNORTH, ANGLE)
  return(NorthKm.EastKm)
}

Once the functions are defined I used them to convert the lat/long coordinates in the xyz file to line-station.

## Read in xyz file.
lat.long.depth <- read.table('etopo1_bedrock.xyz', header = F, col.names = c('long', 'lat', 'depth'))

## Limit to points above sea level.
lat.long.land <- lat.long.depth[which(lat.long.depth$depth >= 0),]

## Create a matrix to hold the output.
line.station.land <- matrix(ncol = 3, nrow = length(lat.long.land$long))
colnames(line.station.depth) <- c('line', 'station', 'depth')

## Execute the ll2gridLter function on each point.  Yes, I'm using a loop to do this.
for(i in 1:length(lat.long.land$long)){
  line.station.land[i,] <- c(ll2gridLter(lat.long.land$lat[i], lat.long.land$long[i]), lat.long.land$depth[i])
  print(paste(c(i, line.station.land[i,])))
}

## Write out the matrix.
write.csv(line.station.land, 'palmer_grid_landmask.csv', row.names = F, quote = F)

At this point I had a nice csv file with line, station, and elevation.  I was able to read this into my existing krigging script and convert into a mask.

## Read in csv file.
landmask <- read.csv('palmer_grid_landmask.csv')

## Limit to the lines and stations that I'm interested in.
landmask <- landmask[which(landmask[,1] <= 825 & landmask[,1] >= -125),]
landmask <- landmask[which(landmask[,2] <= 285 & landmask[,2] >= -25),]

## Interpolated data is at 1 km resolution, need to round off
## to same resolution here.
landmask.expand <- cbind(ceiling(landmask[,1]), ceiling(landmask[,2]))

## Unfortunately this doesn't adequately mask the land.  Need to expand the size of each
## masked pixel 1 km in each direction.
landmask.expand <- rbind(landmask.expand, cbind(floor(landmask[,1]), floor(landmask[,2])))
landmask.expand <- rbind(landmask.expand, cbind(ceiling(landmask[,1]), floor(landmask[,2])))
landmask.expand <- rbind(landmask.expand, cbind(floor(landmask[,1]), ceiling(landmask[,2])))
landmask.expand <- unique(landmask.expand)

I’m not going to cover how I did the krigging in this post.  My krigged data is in matrix called temp.pred.matrix with colnames given by ‘x’ followed by ‘station’, as in x20 for station 20, and row names ‘y’ followed by ‘line’, as in y100 for line 100.  To convert interpolated points that are actually land to NA values I simply added this line to my code:

temp.pred.matrix[cbind(paste0('y', landmask.expand[,1]), paste0('x', landmask.expand[,2] * -1))]

Here’s what the krigged silicate data looks like after masking.

stuff

Silicate concentration in µM at 1m depth during the 2014 Palmer LTER cruise after masking the underlying data.

Excellent.  The masked area corresponds with known landmasses; that’s Adelaide Island (home of Rothera Station) coming in at the eastern portion of Line 300, and various islets and the western edge of the Antarctic Peninsula to the northeast.  At this point erroneous data has been eliminated from the matrix.  Annual inventories of silicate and other nutrients can be more accurately calculated form the data and our eyes are not drawn to interesting features in the interpolation that have no chance of reflecting reality because they are over land.  The white doesn’t look that appealing to me in this plot however, so I masked the land with black by adding points to the plot.  Again, I’m not going to show the whole plotting routine because some variables would require a lengthy explanation about how my larger dataset is structured.  The plot was created using imagep in the oce package.  One thing to be aware of is that this command automatically transposes the matrix, thus mask needs to be transposed as well.  I did this manually in the following line:

## Show masked points as black squares.
points(landmask.expand[,1] ~ {-1 * landmask.expand[,2]}, pch = 15, cex = 0.6)

And the final plot:

si

Silicate concentration in µM at 1m depth during the 2014 Palmer LTER cruise after masking the underlying data and adding points to indicate the masked area.

 

 

Posted in Research | Leave a comment

Astrobiology Primer v2

The long-awaited version 2 of the Astrobiology Primer was published (open access) yesterday in the journal Astrobiology.  I’m not sure who first conceived of the Astrobiology Primer, first published in 2006, but the v2 effort was headed by co-lead editors Shawn Domagal-Goldman at NASA and Katherine Wright  at the UK Space Agency.  The Primer v2 was a long time in coming; initial section text was submitted back in early 2011!  The longer these projects go on, the easy it is for them to die.  Many thanks to Shawn and Katherine for making sure that this didn’t happen!

The downside of course, is that the primer ran the risk of being outdated before it even went into review.  This was mitigated somewhat by the review process itself, and authors did have a chance to update their various sections.  Some sections are more stable than others; the section that I wrote with Shawn McGlynn (now at the Tokyo Institute of Technology) on how life uses energy for example, covers some fairly fundamental ground and is likely to stand the test of time.  Less so for sections that cover planetary processes in and outside of our solar system; paradigms are being broken in planetary science as fast as they form!

The Astrobiology Primer is a very valuable document because it takes a complicated and interdisciplinary field of study and attempts to summarize it for a broad audience.  Most of the Primer should be accessible to anyone with a basic grasp of science.  I wonder if it could even serve as a model for other disciplines.  What if the junior scientists in every discipline (perhaps roughly defined by individual NSF or NASA programs) got together once every five years to write an open-access summary of the major findings in their field?  This might provide a rich and colorful counterpoint to the valuable but often [dry, esoteric, top-down?  There’s an adjective that I’m searching for here but it escapes me] reports produced by the National Academies.

The co-lead editors were critical to the success of the Primer v2.  I haven’t explicitly asked Shawn and Kaitlin if they were compensated in any way for this activity – perhaps they rolled some of this work into various fellowships and such over the years.  More likely this was one more extracurricular activity carried out on the side.  Such is the way science works, and the lines are sufficiently blurred between curricular and extracurricular that most of us don’t even look for them anymore.  In recognition of this, and to speed the publication and heighten the quality of a future Primer v3, it would be nice to see NASA produce a specific funding call for a (small!) editorial team.  Three years of partial salary and funding for a series of writing workshops would make a huge difference in scope and quality.

Posted in Research | Leave a comment

How I learned to stop worrying and love subsampling (rarifying)

I have had the 2014 paper “Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissable” by McMurdie and Holmes sitting on my desk for a while now.  Yesterday I finally got around to reading it and was immediately a little skeptical as a result of the hyperbole with which they criticized the common practice of subsampling* libraries of 16S rRNA gene reads during microbial community analysis.  The logic of that practice proceeds like this:

To develop 16S rRNA gene data (or any other marker gene data) describing a microbial community we generally collect an environmental sample, extract the DNA, amplify it, and sequence the amplified material.  The extract might contain the DNA from 1 billion microbial cells present in the environment.  Only a tiny fraction (<< 1 %) of these DNA molecules actually get sequenced; a “good” sequence run might contain only tens of thousands of sequences per sample.  After quality control and normalizing for multiple copies of the 16S rRNA gene it will contain far fewer.  Thus the final dataset contains a very small random sample of the original population of DNA molecules.

Most sequence-based microbial ecology studies involve some kind of comparison between samples or experimental treatments.  It makes no sense to compare the abundance of taxa between two datasets of different sizes, as the dissimilarity between the datasets will appear much greater than it actually is.  One solution is to normalize by dividing the abundance of each taxa by the total reads in the dataset to get their relative abundance.  In theory this works great, but has the disadvantage that it does not take into account that a larger dataset has sampled the original population of DNA molecules deeper.  Thus more rare taxa might be represented.  A common practice is to reduce the amount of information present in the larger dataset by subsampling to the size of the smaller dataset.  This attempts to approximate the case where both datasets undertake the same level of random sampling.

McMurdie and Holmes argue that this approach is indefensible for two common types of analysis; identifying differences in community structure between multiple samples or treatments, and identifying differences in abundance for specific taxa between samples and treatments.  I think the authors do make a reasonable argument for the latter analysis; however, at worst the use of subsampling and/or normalization simply reduces the sensitivity of the analysis.  I suspect that dissimilarity calculations between treatments or samples using realistic datasets are much less sensitive to reasonable subsampling than the authors suggest.  I confess that I am (clearly) very far from being a statistician and there is a lot in McMurdie and Holmes, 2014 that I’m still trying to digest.  I hope that our colleagues in statistics and applied mathematics continue optimizing these (and other) methods so that microbial ecology can improve as a quantitative science.  There’s no need however, to get everyone spun up without a good reason.  To try and understand if there is a good reason, at least with respect to my data and analysis goals, I undertook some further exploration.  I would strongly welcome any suggestions, observations, and criticisms to this post!

My read of McMurdi and Homes is that the authors object to subsampling because it disregards data that is present that could be used by more sophisticated (i.e. parametric) methods to estimate the true abundance of taxa.  This is true; data discarded from larger datasets does have information that can be used to estimate the true abundance of taxa among samples.  The question is how much of a difference does it really make?  McMurdi and Holmes advocate using methods adopted from transcriptome analysis.  These methods are necessary for transcriptomes because 1) the primary objective of the study is not usually to see if one transcriptome is different from another, but which genes are differentially expressed and 2) I posit that the abundance distribution of transcript data is different than the taxa abundance distribution.  An example of this can be seen in the plots below.

Taxon abundance, taken from a sample that I’m currently working with.

Transcript abundance by PFAM, taken from a sample (selected at random) from the MMETSP.

In both the cases the data is log distributed, with a few very abundant taxa or PFAMs and many rare ones.  What constitutes “abundant” in the transcript dataset however, is very different than “abundant” in the community structure dataset.  The transcript dataset is roughly half the size (n = 9,000 vs. n = 21,000), nonetheless the most abundant transcript has an abundance of 209.  The most abundant OTU has an abundance of 6,589, and there are several very abundant OTUs.  Intuitively this suggests to me that the structure of the taxon dataset is much more reproducible via subsampling than the structure of the transcript dataset, as the most abundant OTUs have a high probability of being sampled.  The longer tail of the transcript data contributes to this as well, though of course this tail is controlled to a large extent by the classification scheme used (here PFAMs).

To get an idea of how reproducible the underlying structure was for the community structure and transcript data I repeatedly subsampled both (with replacement) to 3000 and 1300 observations, respectively.  For the community structure data this is about the lower end of the sample size I would use in an actual analysis – amplicon libraries this small are probably technically biased and should be excluded (McMurdi and Holmes avoid this point at several spots in the paper).  The results of subsampling are shown in these heatmaps, where each row is a new subsample.  For brevity only columns summing to an abundance > 100 are shown.

otu_abundance_heat trans_abundance_heatIn these figures the warmer colors indicate a higher number of observations for that OTU/PFAM.  The transcript data is a lot less reproducible than the community structure data; the Bray-Curtis dissimilarity across all iterations maxes out at 0.11 for community structure and 0.56 for the transcripts.  The extreme case would be if the data were normally distributed (i.e. few abundant and few rare observations, many intermediate observations).  Here’s what subsampling does to normally distributed data (n = 21,000, mean = 1000, sd = 200):

otu_random_abundance

If you have normally distributed data don’t subsample!

For the rest of us it seems that for the test dataset used here community structure is at least somewhat reproducible via subsampling.  There are differences between iterations however, what does this mean in the context of the larger study?

The sample was drawn from a multiyear time series from a coastal marine site.  The next most similar sample (in terms of community composition and ecology) was, not surprisingly, a sample taken one week later.  By treating this sample in an identical fashion, then combining the two datasets, it was possible to evaluate how easy it is to tell the two samples apart after subsampling.  In this heatmap the row colors red and black indicate iterations belonging to the two samples:

Clustering of repeated subsamplings from two similar samples.  Sample identity is given by the red or black color along the y-axis.

 

As this heatmap shows, for these two samples there is perfect fidelity.  Presumably with very similar samples this would start to break down, determining how similar samples need to be before they cannot be distinguished at a given level of subsampling would be a useful exercise.  The authors attempt to do this in Simlation A/Figure 5 in the paper, but it isn’t clear to me why their results are so poor – particularly given very different sample types and a more sophisticated clustering method than I’ve applied here.

As a solution – necessary for transcript data, normally distributed data, or for analyses of differential abundance, probably less essential for comparisons of community structure – the authors propose a mixture model approach that takes in account variance across replicates to estimate “real” OTU abundance.  Three R packages that can do this are mentioned; edgeR, DESeq, and metagenomeSeq.  The problem with these methods – as I understand them – is that they require experimental replicates.  According to the DESeq authors, technical replicates should be summed, and samples should be assigned to treatment pools (e.g. control, treatment 1, treatment 2…).  Variance is calculated within each pool and this is used to to model differences between pools.  This is great for a factor-based analysis, as is common in transcriptome analysis or human microbial ecology studies.  If you want to find a rare, potentially disease-causing strain differently present between a healthy control group and a symptomatic experimental group for example, this is a great way to go about it.

There are many environmental studies for which these techniques are not useful however, as it may be impractical to collect experimental replicates.  For example it is both undesirable and impractical to conduct triplicate or even duplicate sampling in studies focused on high-resolution spatial or temporal sampling.  Sequencing might be cheap now, but time and reagents are not.  Some of the methods I’m working with are designed to aggregate samples into higher-level groups – at which point these methods could be applied by treating within-group samples as “replicates” – but this is only useful if we are interested in testing differential abundance between groups (and doesn’t solve the problem of needing to get samples grouped in the first place).

These methods can be used to explore differential abundance in non-replicated samples, however, they are grossly underpowered when used without replication.  Here’s an analysis of differential abundance between the sample in the first heatmap above and its least similar companion from the same (ongoing) study using DESeq.  You can follow along with any standard abundance table where the rows are samples and the columns are variables.

library(DESeq)

## DESeq wants to oriented opposite how community abundance
## data is normally presented (e.g. to vegan)
data.dsq <- t(data)

## DESeq requires a set of conditions which are factors.  Ideally
## this would be control and treatment groups, or experimental pools
## or some such, but we don't have that here.  So the conditions are
## unique column names (which happen to be dates).
conditions <- factor(as.character(colnames(data.dsq)))

## As a result of 16S rRNA gene copy number normalization abundance
## data is floating point numbers, convert to integers.
data.dsq <- ceiling(data.dsq, 0)

## Now start working with DESeq.
data.ct <- newCountDataSet(as.data.frame(data.dsq), conditions = conditions)
data.size <- estimateSizeFactors(data.ct)

## This method and sharing mode is required for unreplicated samples.
data.disp <- estimateDispersions(data.size, method = 'blind', sharingMode="fit-only")

## And now we can execute a test of differential abundance for the
## two samples used in the above example.
test <- nbinomTest(data.disp, '2014-01-01', '2014-01-06')
test <- na.omit(test)

## Plot the results.
plot(test$baseMeanA, test$baseMeanB,
     #log = 'xy',
     pch = 19,
     ylim = c(0, max(test$baseMeanB)),
     xlim = c(0, max(test$baseMeanA)),
     xlab = '2014-01-01',
     ylab = '2009-12-14')

abline(0, 1)

## If we had significant differences we could then plot them like
## this:
points(test$baseMeanA[which(test$pval < 0.05)],
       test$baseMeanB[which(test$pval < 0.05)],
       pch = 19,
       col = 'red')

binomial

 

As we would expect there are quite a few taxa present in high abundance in one sample and not the other, however, none of the associated p-values are anywhere near significant.  I’m tempted to try to use subsampling to create replicates, which would allow an estimate of variance across subsamples and access to greater statistical power.  This is clearly not as good as real biological replication, but we have to work within the constraints of our experiments, time, and funding…

*You might notice that I’ve deliberately avoided using the terms “microbiome” and “rarefying” here.  In one of his comics Randall Munroe asserted that the number of made up words in a book is inversely proportional to the quality of the book, similarly I strongly suspect that the validity of a sub-discipline is inversely proportional to the number of jargonistic words that community makes up to describe its work.  As a member of said community, what’s wrong with subsampling and microbial community??

Posted in Research | 2 Comments

Exploring genome content and genomic character with paprica and R

The paprica pipeline was designed to infer the genomic content and genomic characteristics of a set of 16S rRNA gene reads.  To enable this the paprica database organizes this information by phylogeny for many of the completed genomes in Genbank.  In addition to metabolic inference this provides an opportunity to explore how genome content and genomic characteristics are organized phylogenetically.  The following is a brief analysis of some genomic features using the paprica database and R.  If you aren’t familiar with the paprica database this exercise will also familiarize you with some of its content and its organization.

The paprica pipeline and database can be obtained from Github here.  In this post I’ll be using the database associated with version 0.3.1.  The necessary files from the bacteria database (one could also conduct this analysis on the much smaller archaeal database) can be read into R as such:

## Read in the pathways associated with the terminal nodes on the reference tree
path <- read.csv('paprica/ref_genome_database/bacteria/terminal_paths.csv', row.names = 1)
path[is.na(path)] <- 0

## Read in the data associated with all completed genomes in Genbank
data <- read.csv('paprica/ref_genome_database/bacteria/genome_data.final.csv', row.names = 1)

## During database creation genomes with duplicate 16S rRNA genes were removed,
## so limit to those that were retained
data <- data[row.names(data) %in% row.names(path),]

## "path" is ordered by clade, meaning it is in top to bottom order of the reference tree,
## however, "data" is not, so order it here
data <- data[order(data$clade),]

One fun thing to do at this point is to look at the distribution of metabolic pathways across the database.  To develop a sensible view it is best to cluster the pathways according to which genomes they are found in.

## The pathway file in the database is binary, so we use Jaccard for distance
library('vegan')
path.dist <- vegdist(t(path), method = 'jaccard') # distance between pathways (not samples!)
path.clust <- hclust(path.dist)

The heatmap function is a bit cumbersome for this large matrix, so the visualization can be made using the image function.

## Set a binary color scheme
image.col <- colorRampPalette(c('white', 'blue'))(2)

## Image will order matrix in ascending order, which is not what we want here!
image(t(data.matrix(path))[rev(path.clust$order),length(row.names(path)):1],
      col = image.col,
      ylab = 'Genome',
      xlab = 'Pathway',
      xaxt = 'n',
      yaxt = 'n')

box()
The distribution of metabolic pathways across all 3,036 genomes in the v0.3.1 paprica database.

The distribution of metabolic pathways across all 3,036 genomes in the v0.3.1 paprica database.

There are a couple of interesting things to note in this plot.  First, we can see the expected distribution of core pathways present in nearly all genomes, and the interesting clusters of pathways that are unique to a specific lineage.  For clarity row names have been omitted from the above plot, but from within R you can pull out the taxa or pathways that interest you easily enough.  Second, there are some genomes that have very few pathways.  There are a couple of possible reasons for this that can be explored with a little more effort.  One possibility is that these are poorly annotated genomes, or at least the annotation didn’t associate many or any coding sequences with either EC numbers or GO terms – the two pieces of information Pathway-Tools uses to predict pathways during database construction.  Another possibility is that these genomes belong to obligate symbionts (either parasites or beneficial symbionts).  Obligate symbionts often have highly streamlined genomes and few complete pathways.  We can compare the number of pathways in each genome to other genome characteristics for additional clues.

A reasonable assumption is that the number of pathways in each genome should scale with the size of the genome.  Large genomes with few predicted pathways might indicate places where the annotation isn’t compatible with the pathway prediction methods.

## Plot the number of pathways as a function of genome size
plot(rowSums(path) ~ data$genome_size,
     ylab = 'nPaths',
     xlab = 'Genome size')

## Plot P. ubique as a reference point
select <- grep('Pelagibacter ubique HTCC1062', data$organism_name)
points(rowSums(path)[select] ~ data$genome_size[select],
       pch = 19,
       col = 'red')
The number of metabolic pathways predicted as a function of genome size for the genomes in the paprica database.

The number of metabolic pathways predicted as a function of genome size for the genomes in the paprica database.

That looks pretty good.  For the most part more metabolic pathways were predicted for larger genomes, however, there are some exceptions.  The red point gives the location of Pelagibacter ubique HTCC1062.  This marine bacterium is optimized for life under energy-limited conditions.  Among its adaptations are a highly efficient and streamlined genome.  In fact it has the smallest genome of any known free-living bacterium.  All the points below it on the x-axis are obligate symbionts; these are dependent on their host for some of their metabolic needs.  There are a few larger genomes that have very few (or even no) pathways predicted.  These are the genomes with bad, or at least incompatible annotations (or particularly peculiar biochemistry).

The other genome parameters in paprica are the number of coding sequences identified (nCDS), the number of genetic elements (nge), the number of 16S rRNA gene copies (n16S), GC content (GC), and phi; a measure of genomic plasticity.  We can make another plot to show the distribution of these parameters with respect to phylogeny.

## Grab only the data columns we want
data.select <- data[,c('n16S', 'nge', 'ncds', 'genome_size', 'phi', 'GC')]

## Make the units somewhat comparable on the same scale, a more
## careful approach would log-normalize some of the units first
data.select.norm <- decostand(data.select, method = 'standardize')
data.select.norm <- decostand(data.select.norm, method = 'range')

## Plot with a heatmap
heat.col <- colorRampPalette(c('blue', 'white', 'red'))(100)
heatmap(data.matrix(data.select.norm),
      margins = c(10, 20),
      col = heat.col,
      Rowv = NA,
      Colv = NA,
      scale = NULL,
      labRow = 'n',
      cexCol = 0.8)
Genomic parameters organized by phylogeny.

Genomic parameters organized by phylogeny.

Squinting at this plot it looks like GC content and phi are potentially negatively correlated, which could be quite interesting.  These two parameters can be plotted to get a better view:

plot(data.select$phi ~ data.select$GC,
     xlab = 'GC',
     ylab = 'phi')
The phi parameter of genomic plasticity as a function of GC content.

The phi parameter of genomic plasticity as a function of GC content.

Okay, not so much… but I think the pattern here is pretty interesting.  Above a GC content of 50 % there appears to be no relationship, but these parameters do seem correlated for low GC genomes.  This can be evaluated with linear models for genomes above and below 50 % GC.

gc.phi.above50 <- lm(data.select$phi[which(data.select$GC >= 50)] ~ data.select$GC[which(data.select$GC >= 50)])
gc.phi.below50 <- lm(data.select$phi[which(data.select$GC < 50)] ~ data.select$GC[which(data.select$GC < 50)])

summary(gc.phi.above50)
summary(gc.phi.below50)

plot(data.select$phi ~ data.select$GC,
     xlab = 'GC',
     ylab = 'phi',
     type = 'n')

points(data.select$phi[which(data.select$GC >= 50)] ~ data.select$GC[which(data.select$GC >= 50)],
       col = 'blue')

points(data.select$phi[which(data.select$GC < 50)] ~ data.select$GC[which(data.select$GC < 50)],
       col = 'red')

abline(gc.phi.above50,
       col = 'blue')

abline(gc.phi.below50,
       col = 'red')

legend('bottomleft',
       bg = 'white',
       legend = c('GC >= 50',
                  'GC < 50'),
       col = c('blue', 'red'),
       pch = 1)

Genomic plasticity (phi) as a function of GC content for all bacterial genomes in the paprica database.

As expected there is no correlation between genomic plasticity and GC content for the high GC genomes (R2 = 0) and a highly significant correlation for the low GC genomes (albeit with weak predictive power; R2 = 0.106, p = 0).  So what’s going on here?  Low GC content is associated with particular microbial lineages but also with certain ecologies.  The free-living low-energy specialist P. ubique HTCC1062 has a low GC content genome for example, as do many obligate symbionts regardless of their taxonomy (I don’t recall if it is known why this is).  Both groups are associated with a high degree of genomic modification, including genome streamlining and horizontal gene transfer.

Posted in paprica | Leave a comment

The 6 cent speeding ticket

I’m going to go way off the normal track here and do a bit of social commentary.  I heard a radio piece on my drive home yesterday about the challenge of paying court and legal fees for low income wage earners.  This can trap those guilty of minor offenses (like a traffic infraction) in a cycle of jail and debt that is difficult to break out of.  It never made sense to me that financial penalties – which by their nature are punitive – don’t scale with income, as is common in some European countries.  I decided to try and visualize how the weight of a penalty for say, a speeding ticket, scales with income.  It was tempting to try and scale up, i.e. what is the equivalent of $300 for someone earning $X?  I decided however, that it would be more informative to try and scale down.

Cost equivalent in minimum wage earner dollars of a $300 penalty for individuals at different income levels. Red text is the cost equivalent (y-axis value). X-axis (income) is on a log scale. See text for income data sources.

In this scenario the penalty is $300.  Someone earning the federal minimum wage and working 40 hours a week makes $15,080/year, so the $300 penalty is roughly 2 % of annual income.  So be it, perhaps that’s a fair penalty.  But what would be the equivalent if the same offender earned more?  A private/seaman/airman who has just joined the military earns roughly $18,561/year.  Paying the same ticket (and I know from experience that the military pays a lot of them) would equate to the minimum wage earner paying $243.72.  A graduate student fortunate enough to get a stipend (and own a car) might earn $25,000/year.  Paying the same ticket would be equivalent to the lowest wage earner paying $180.96, and down it goes along the income scale.  If LeBron James, who earned $77.2 million last year in salary and endorsements (according to Forbes), got the ticket, the penalty would be equivalent to the lowest income wage earner paying $0.06.  Salary data came from a variety of sources, including here and here.  Salaries marked with an asterisk in the plot above are medians from these sources.

Posted in Uncategorized | 1 Comment

The Database Dilemma

Microbial ecologists know they have a problem with data archiving, particularly when it comes to sequence data.  I’m not entirely sure why this is the case; in theory it would be pretty straightforward to build a database searchable by the parameters ecologists are interested in.  I suspect that part of the problem is a lack of interest on the part of NSF (the primary source of funding for ecology) in designing and maintaining databases, and the perception that this is duplicative of the (rather sad) efforts of the NIH-funded NCBI in this area.  I don’t want to be too disparaging of NCBI – they have a much harder job than the simple database I’ve outlined above – but the Sequence Read Archive (SRA) and other NCBI repositories are a near-total disaster.  We’re wasting a lot of collective time, money, and effort developing sequence data that can’t be used for future studies because it can’t be easily accessed.

In response to this a small number of alternate repositories have popped up.  In my experience however, they don’t offer much of an improvement.  This isn’t meant as a harsh criticism of various peoples’ good efforts, I just don’t think the community’s found a viable model yet.  One example is the VAMPS database operated by the Marine Biological Laboratory, which in addition to having an unfortunate name (I believe it was named before the onset of the angsty-teen vampire thing, try googling “VAMPS” and you’ll see what I mean), tries to do too much.  Rather than organizing data and making it searchable the designers opted to try and build an online analysis pipeline.  Instead of easy to find data you end up with an analytical black box and canned results.

The reason for all this soap-boxing is my experience this week trying to obtain some global marine 16S rRNA gene amplicon data for a project that I’m working on.  The data I had hoped to obtain was used in the 2014 PNAS paper Marine bacteria exhibit a bipolar distribution.  My difficulty in obtaining the data highlights failures at the author, reviewer, journal, and repository levels.  I have great respect for many of the authors on this paper, so again this isn’t meant as a harsh criticism (I’m sure I haven’t made data I’ve published sufficiently accessible either), but rather a lesson in something that we all need to do better at.  I’ll end with the Python-based solution I came up with for getting useful sequence data out of the VAMPS flat-file format.

The 2014 paper used 277 samples that were collected as part of the International Census of Marine Microbes (ICoMM).  There are quite a few things one can say about that particular endeavor but we’ll save it for another post.  The authors included the accession numbers for all these studies in a supplementary file.  This file is a pdf, so not the easiest thing to parse, but at least the text is renderable, so the (7-page) table can be copied into a text file and parsed further.  With a little bit of effort you can get a single-column text file containing the accession numbers for each of these studies.  Great.  Unfortunately a good number of these are wildly incorrect, and most of the remainder point to SRA objects that can’t be recognized by the woefully inadequately documented SRA toolkit.  Here are some examples:

SRX011052 – Enter this accession number into the SRA search bar and it will navigate you to a page for a sample with the helpful name “Generic sample from marine metagenome”.  Nevermind for a moment that “metagenome” is an inappropriate name for an amplicon dataset, where is the data?  Click the link for “Run” (with a different accession number: SRR027219) and navigate to a new page with lots of obscure information you don’t care about.  If you’re actually doing this you might, in near desperation, click the “Download” tab in the hope that this gets you somewhere.  Psych!  All that page has is a message that you need to use the SRA toolkit to download this data.  Let’s try the prefetch utility in the toolkit:

>prefetch SRR027219

2016-05-25T15:30:43 prefetch.2.4.2 err: path not found while resolving tree
within virtual file system module - 'SRR027219.sra' cannot be found.

Whatever that means.  There’s a trick to using this utility, which I’ll post as soon as I remember what it is.  The documentation suggests that the above command should work.  The absurdity of needing to rely on this utility in place of FTP usually drives me to use the EMBL-EBI site instead.  This is the Euro equivalent of Genbank, anything submitted to one should be accessible through the other.  In general EMBL-EBI is much better organized and more user friendly than SRA.

Entering “SRX011052” into the ENBL-EBI search bar returns some results that also point us to SRR027219.  The page for SRR027219 has such helpful things as an FTP link for direct download as well as some basic info on the sample.  That’s great, but because the authors opted to provide “Experiment” accession numbers for some samples (e.g. SRX011052) there is no way that I can see to generate a script to automate downloads, as we have no way of knowing the run accession numbers.  The solution is to tediously follow the links to all 277 runs.  I actually started doing this which lead me to my second problem.

SRP001129 – At some point in the supplementary table the accession numbers switch from a “SRX” prefix to “SRP”.  Picking one of these at random and entering it into the EMBL-EBI search bar returns a result for a genome sequencing project related to the Human Microbiome Project.  Clearly this isn’t from the ICoMM project.  Roughly a third of the accession numbers reported in the paper aren’t correct!

Clearly I wasn’t going to get the data that I needed through either EMBL-EBI or the SRA.  Knowing that the study had been part of the ICoMM I turned to the MBL VAMPS database where the ICoMM data is also archived.  Thinking that I was on the home stretch I followed the link to the ICoMM portal from the VAMPS homepage.  The right side column gave me two options for download; tar-zipped sff files or a tab-delimited file.  Not wanting to deal with converting from sff (the organic 454 file format) I took a look at the tab-delim option.  The top lines of the file look like this:

project dataset sequence        taxonomy        knt     rank    total_knt       frequency
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGACGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGA  Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  13      family  36172   0.000359394006
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCTAGAGAG      Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGATGAAGATTCAGCGTAATGAGCTTATCGGATTTTCAGAGAG      Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAGGGCCGACTGTTAGATGAAGACCAGTGTAACGAACTTGTCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAGGGCCGACTGTTATATGAAGACCAATGTGATGAACTTGTCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  2       family  36172   5.5291386e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAGGGCCGACTGTTATATGAAGACCAGCGTAATGAGCTTGTCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAGGGCCGACTGTTATATGAAGACCAGCGTGATGAACTTGTCGGATTTTCAGAGAG   Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05

Okay, so this is a bizarre way to share sequence data but at this point I was willing to take it.  Basically the file gives us unique reads, their abundance (knt), and taxonomy according to VAMPS.  Want I needed to do was convert this into a separate, non-unique fasta file for each sample.  With a little help from Pandas this wasn’t too bad.  The vamps_icomm_surface.csv file specified below is just a csv version of the pdf table in the PNAS publication, after some tidying up.  The second row of the csv file (i.e. index 1 in Python land) is the sample name in VAMPS (I had to split the first column of the original table on ‘.’ to get this).

import pandas as pd

## get the metadata for surface samples, this was generated
## from the pdf table in the PNAS paper

metadata = pd.read_csv('vamps_icomm_surface.csv', index_col = 1)

## get the seqs, downloaded from https://vamps.mbl.edu/portals/icomm/data_exports/icomm_data.tar.gz
seqs = pd.read_csv('icomm_data.tsv', delimiter = '\t', index_col = 1)

## iterate across each study in metadata, making a temp dataframe
## of reads from that study

for study in metadata.index:
    temp = seqs.loc[study]
    temp_bacteria = temp[temp['taxonomy'].str.contains('Bacteria')]
    
    ## now iterate across each row in the temp dataframe, exporting
    ## the sequence the specified number of times to a fasta file
    
    with open(study + '.bacteria.fasta', 'w') as output:
        for row in temp.iterrows():
            sequence = row[1]['sequence']
            for i in range(1, row[1]['knt']):
                name = row[0] + '.' + str(i)
                print name
                print >> output, '>' + name
                print >> output, sequence

To end I’d like to circle back to the original dual problems of poor database design and erroneous accession numbers.  Although these problems were not directly connected in this case they exacerbated each other, magnifying the overall problem of data (non)reuse.  Some suggestions:

  1.  The SRA and other databases needs to aggregate the run accession numbers for each new study.  A couple of clicks should lead me to a download link for every sequence associated with a study, whether that study generated the raw data or only used it.  I don’t believe it is currently possible to do this within SRA; initiating a new project assumes that you will contribute new raw sequences (correct me if you think otherwise).  This should be easy to implement, and the association of a publication with a project should be rigorously enforced by journals, departments, and funding agencies.
  2. Accession numbers should be automatically checked by journals the way references are.  If it points to something that isn’t downloadable (or at least has a download link on the page) the author should correct it.  There is no excuse for incorrect, incomplete, or non-run accession numbers.  That this was allowed to happen in PNAS – a top-tier journal – is double frustrating.  And frankly, if you reviewed this paper you owe me beer for not spot checking the refs.
  3. Validating this information is the job of the reviewers.  Almost nobody does this.  Usually when I review a paper containing sequence data I’m the only reviewer that insists the data is available at the time of publication (I can’t think of a single case where I wasn’t the only reviewer to insist).  Not only is this a necessary check, but part of the job of the reviewer is to spot check the analysis itself.  Clearly, without data that’s not being done either…

And that’s enough ranting about data availability for one day, back to work…

Posted in Research | Leave a comment