The Database Dilemma

Microbial ecologists know they have a problem with data archiving, particularly when it comes to sequence data.  I’m not entirely sure why this is the case; in theory it would be pretty straightforward to build a database searchable by the parameters ecologists are interested in.  I suspect that part of the problem is a lack of interest on the part of NSF (the primary source of funding for ecology) in designing and maintaining databases, and the perception that this is duplicative of the (rather sad) efforts of the NIH-funded NCBI in this area.  I don’t want to be too disparaging of NCBI – they have a much harder job than the simple database I’ve outlined above – but the Sequence Read Archive (SRA) and other NCBI repositories are a near-total disaster.  We’re wasting a lot of collective time, money, and effort developing sequence data that can’t be used for future studies because it can’t be easily accessed.

In response to this a small number of alternate repositories have popped up.  In my experience however, they don’t offer much of an improvement.  This isn’t meant as a harsh criticism of various peoples’ good efforts, I just don’t think the community’s found a viable model yet.  One example is the VAMPS database operated by the Marine Biological Laboratory, which in addition to having an unfortunate name (I believe it was named before the onset of the angsty-teen vampire thing, try googling “VAMPS” and you’ll see what I mean), tries to do too much.  Rather than organizing data and making it searchable the designers opted to try and build an online analysis pipeline.  Instead of easy to find data you end up with an analytical black box and canned results.

The reason for all this soap-boxing is my experience this week trying to obtain some global marine 16S rRNA gene amplicon data for a project that I’m working on.  The data I had hoped to obtain was used in the 2014 PNAS paper Marine bacteria exhibit a bipolar distribution.  My difficulty in obtaining the data highlights failures at the author, reviewer, journal, and repository levels.  I have great respect for many of the authors on this paper, so again this isn’t meant as a harsh criticism (I’m sure I haven’t made data I’ve published sufficiently accessible either), but rather a lesson in something that we all need to do better at.  I’ll end with the Python-based solution I came up with for getting useful sequence data out of the VAMPS flat-file format.

The 2014 paper used 277 samples that were collected as part of the International Census of Marine Microbes (ICoMM).  There are quite a few things one can say about that particular endeavor but we’ll save it for another post.  The authors included the accession numbers for all these studies in a supplementary file.  This file is a pdf, so not the easiest thing to parse, but at least the text is renderable, so the (7-page) table can be copied into a text file and parsed further.  With a little bit of effort you can get a single-column text file containing the accession numbers for each of these studies.  Great.  Unfortunately a good number of these are wildly incorrect, and most of the remainder point to SRA objects that can’t be recognized by the woefully inadequately documented SRA toolkit.  Here are some examples:

SRX011052 – Enter this accession number into the SRA search bar and it will navigate you to a page for a sample with the helpful name “Generic sample from marine metagenome”.  Nevermind for a moment that “metagenome” is an inappropriate name for an amplicon dataset, where is the data?  Click the link for “Run” (with a different accession number: SRR027219) and navigate to a new page with lots of obscure information you don’t care about.  If you’re actually doing this you might, in near desperation, click the “Download” tab in the hope that this gets you somewhere.  Psych!  All that page has is a message that you need to use the SRA toolkit to download this data.  Let’s try the prefetch utility in the toolkit:

>prefetch SRR027219

2016-05-25T15:30:43 prefetch.2.4.2 err: path not found while resolving tree
within virtual file system module - 'SRR027219.sra' cannot be found.

Whatever that means.  There’s a trick to using this utility, which I’ll post as soon as I remember what it is.  The documentation suggests that the above command should work.  The absurdity of needing to rely on this utility in place of FTP usually drives me to use the EMBL-EBI site instead.  This is the Euro equivalent of Genbank, anything submitted to one should be accessible through the other.  In general EMBL-EBI is much better organized and more user friendly than SRA.

Entering “SRX011052” into the ENBL-EBI search bar returns some results that also point us to SRR027219.  The page for SRR027219 has such helpful things as an FTP link for direct download as well as some basic info on the sample.  That’s great, but because the authors opted to provide “Experiment” accession numbers for some samples (e.g. SRX011052) there is no way that I can see to generate a script to automate downloads, as we have no way of knowing the run accession numbers.  The solution is to tediously follow the links to all 277 runs.  I actually started doing this which lead me to my second problem.

SRP001129 – At some point in the supplementary table the accession numbers switch from a “SRX” prefix to “SRP”.  Picking one of these at random and entering it into the EMBL-EBI search bar returns a result for a genome sequencing project related to the Human Microbiome Project.  Clearly this isn’t from the ICoMM project.  Roughly a third of the accession numbers reported in the paper aren’t correct!

Clearly I wasn’t going to get the data that I needed through either EMBL-EBI or the SRA.  Knowing that the study had been part of the ICoMM I turned to the MBL VAMPS database where the ICoMM data is also archived.  Thinking that I was on the home stretch I followed the link to the ICoMM portal from the VAMPS homepage.  The right side column gave me two options for download; tar-zipped sff files or a tab-delimited file.  Not wanting to deal with converting from sff (the organic 454 file format) I took a look at the tab-delim option.  The top lines of the file look like this:

project dataset sequence        taxonomy        knt     rank    total_knt       frequency
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGACGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGA  Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  13      family  36172   0.000359394006
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCTAGAGAG      Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGATGAAGATTCAGCGTAATGAGCTTATCGGATTTTCAGAGAG      Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAGGGCCGACTGTTAGATGAAGACCAGTGTAACGAACTTGTCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAGGGCCGACTGTTATATGAAGACCAATGTGATGAACTTGTCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  2       family  36172   5.5291386e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAGGGCCGACTGTTATATGAAGACCAGCGTAATGAGCTTGTCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAGGGCCGACTGTTATATGAAGACCAGCGTGATGAACTTGTCGGATTTTCAGAGAG   Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05

Okay, so this is a bizarre way to share sequence data but at this point I was willing to take it.  Basically the file gives us unique reads, their abundance (knt), and taxonomy according to VAMPS.  Want I needed to do was convert this into a separate, non-unique fasta file for each sample.  With a little help from Pandas this wasn’t too bad.  The vamps_icomm_surface.csv file specified below is just a csv version of the pdf table in the PNAS publication, after some tidying up.  The second row of the csv file (i.e. index 1 in Python land) is the sample name in VAMPS (I had to split the first column of the original table on ‘.’ to get this).

import pandas as pd

## get the metadata for surface samples, this was generated
## from the pdf table in the PNAS paper

metadata = pd.read_csv('vamps_icomm_surface.csv', index_col = 1)

## get the seqs, downloaded from https://vamps.mbl.edu/portals/icomm/data_exports/icomm_data.tar.gz
seqs = pd.read_csv('icomm_data.tsv', delimiter = '\t', index_col = 1)

## iterate across each study in metadata, making a temp dataframe
## of reads from that study

for study in metadata.index:
    temp = seqs.loc[study]
    temp_bacteria = temp[temp['taxonomy'].str.contains('Bacteria')]
    
    ## now iterate across each row in the temp dataframe, exporting
    ## the sequence the specified number of times to a fasta file
    
    with open(study + '.bacteria.fasta', 'w') as output:
        for row in temp.iterrows():
            sequence = row[1]['sequence']
            for i in range(1, row[1]['knt']):
                name = row[0] + '.' + str(i)
                print name
                print >> output, '>' + name
                print >> output, sequence

To end I’d like to circle back to the original dual problems of poor database design and erroneous accession numbers.  Although these problems were not directly connected in this case they exacerbated each other, magnifying the overall problem of data (non)reuse.  Some suggestions:

  1.  The SRA and other databases needs to aggregate the run accession numbers for each new study.  A couple of clicks should lead me to a download link for every sequence associated with a study, whether that study generated the raw data or only used it.  I don’t believe it is currently possible to do this within SRA; initiating a new project assumes that you will contribute new raw sequences (correct me if you think otherwise).  This should be easy to implement, and the association of a publication with a project should be rigorously enforced by journals, departments, and funding agencies.
  2. Accession numbers should be automatically checked by journals the way references are.  If it points to something that isn’t downloadable (or at least has a download link on the page) the author should correct it.  There is no excuse for incorrect, incomplete, or non-run accession numbers.  That this was allowed to happen in PNAS – a top-tier journal – is double frustrating.  And frankly, if you reviewed this paper you owe me beer for not spot checking the refs.
  3. Validating this information is the job of the reviewers.  Almost nobody does this.  Usually when I review a paper containing sequence data I’m the only reviewer that insists the data is available at the time of publication (I can’t think of a single case where I wasn’t the only reviewer to insist).  Not only is this a necessary check, but part of the job of the reviewer is to spot check the analysis itself.  Clearly, without data that’s not being done either…

And that’s enough ranting about data availability for one day, back to work…

6214 Total Views 2 Views Today
This entry was posted in Research. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield