For an ongoing project I needed to predict the metabolic pathways that are present in a metagenome. This is actually something that I’ve been interested in trying to do for a while, as metabolic pathways can tell us much more about metabolic function than genes alone. There may be some pre-packaged methods out there for doing this, but I’ve struggled over the years to find one that doesn’t require commercial software or a subscription to a commercial database (i.e. KEGG).
I’ve worked out the following method that relies on DIAMOND, Biopython, and Pathway-Tools. All are available for free for academic users. I went down this path at the request of a reviewer for a submitted manuscript, if you use the method please cite the (hopefully) forthcoming work – Microbial communities can be described by metabolic structure: A general framework and application to a seasonally variable, depth-stratified microbial community from the coastal West Antarctic Peninsula. I haven’t archived the work on a pre-press server (I’m still not sure how I feel about those), but I’m happy to provide a copy of the manuscript and/or the citation. I’ll try to remember to add the citation here when (if!) the revised manuscript is accepted (Accepted! see link in this article).
The general strategy here is to:
1. Shake your fist at the hard, cruel world that forced KEGG to go commercial.
2. Create a DIAMOND database of all the coding sequences present in all completed Genbank genomes.
3. Use DIAMOND blastx to search all the metagenomic reads against this database.
4. Create an artificial Genbank format sequence record containing all the features that had a hit. To avoid time and memory constraints this record will contain only unique BRENDA EC identifiers and gene products.
5. Use pathway tools to predict the metabolic pathways in this artificial “genome”.
A word of warning… I hacked this together in a non-linear fashion, thus the code presented here is NOT exactly what I ran. There might be some errors, so if you try this and it doesn’t run just take a deep breath and track down the errors. Feel free to ping me if you can’t find the problem. Okay, here we go!
Download the complete genomes from Genbank and go get a fresh cup of coffee:
wget -r -A *.gbk -nc ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/
Next, extract all the feature translations from these Genbank files (yes, you could just download the faa files, but I had reasons for needing to do it this way). Pay attention to directory locations (here and in the other scripts)!
### user setable variables ### ref_dir = '/volumes/hd1/ref_genome_database/ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/' # location of the database directory ### end user setable variables ### import os from Bio import SeqIO with open('all_genomes.fasta', 'w') as fasta_out, open('all_genomes_map.txt', 'w') as map_out: for d in os.listdir(ref_dir): for f in os.listdir(ref_dir + d): if f.endswith('.gbk'): gbk_name = f for record in SeqIO.parse(ref_dir + d + '/' + gbk_name, 'genbank'): parent = record.seq for feature in record.features: if feature.type == 'CDS': start = int(feature.location.start) end = int(feature.location.end) strand = feature.location.strand try: trans = feature.qualifiers['translation'][0] feature_id = (start, end, strand) if strand == -1: sequence = parent[start:end].reverse_complement() else: sequence = parent[start:end] seqname = record.id + '_' + str(start) + '_' + str(end) + '_' + str(strand) print start, end, sequence[0:10] print >> fasta_out, '>' + seqname + '\n' + trans print >> map_out, d, f, seqname except KeyError: continue
Make a DIAMOND database:
diamond makedb --in all_genomes.fasta -d all_genomes_diamond
And run DIAMOND blastx (so fast!). This took me about an hour with 24 cpus and 32 Gb RAM for a metagenome with 22 million reads…
diamond blastx -d all_genomes_diamond -q test_mg.good.fasta -o test_mg.good.diamond.txt -k 1
Now create the Genbank file containing all features that have a hit.
ref_dir = 'ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/' import os from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq from Bio.Alphabet import IUPAC map_dict = {} with open('all_genomes_map.txt', 'r') as map_in: for line in map_in: line = line.rstrip() line = line.split() map_dict[line[2]] = line[0] hits = set() genomes = set() with open('test_mg.good.diamond.txt', 'r') as diamond: for line in diamond: line = line.rstrip() line = line.split() evalue = float(line[10]) hit = line[1] if evalue < 1e-5: hits.add(hit) print 'hit =', hit genome = map_dict[hit] genomes.add(genome) with open('all_genomes_nr/genetic-elements.dat', 'w') as all_genetic_elements: print >> all_genetic_elements, 'ID' + '\t' + 'all_genomes_nr' print >> all_genetic_elements, 'NAME' + '\t' + 'all_genomes_nr' print >> all_genetic_elements, 'TYPE' + '\t' + ':CHRSM' print >> all_genetic_elements, 'CIRCULAR?' + '\t' + 'N' print >> all_genetic_elements, 'ANNOT-FILE' + '\t' + 'all_genomes_nr.gbk' print >> all_genetic_elements, '//' with open('all_genomes_nr/organism-params.dat', 'w') as all_organism_params: print >> all_organism_params, 'ID' + '\t' + 'all_genomes' print >> all_organism_params, 'Storage' + '\t' + 'File' print >> all_organism_params, 'Name' + '\t' + 'all_genomes' print >> all_organism_params, 'Rank' + '\t' + 'Strain' print >> all_organism_params, 'Domain' + '\t' + 'TAX-2' print >> all_organism_params, 'Create?' + '\t' + 't' good_features = [] ec = set() products = set() for d in os.listdir(ref_dir): if d in genomes: for f in os.listdir(ref_dir + d): for record in SeqIO.parse(ref_dir + d + '/' + f, 'genbank'): for feature in record.features: keep = False if feature.type == 'CDS': try: temp_ec = feature.qualifiers['EC_number'] for each in temp_ec: if each not in ec: keep = True ec.add(each) print feature.type, each if keep == True: good_features.append(feature) except KeyError: try: temp_product = feature.qualifiers['product'][0] if temp_product not in products: good_features.append(feature) products.add(temp_product) print feature.type, temp_product except KeyError: continue new_record = SeqRecord(Seq('nnnn', alphabet = IUPAC.ambiguous_dna), id = 'all_genomes_nr', name = 'all_genomes_nr', features = good_features) SeqIO.write(new_record, open('all_genomes_nr/all_genomes_nr.gbk', 'w'), 'genbank')
Once you’ve waded your way through that mess you can run the Pathos program in Pathway-Tools from the command line as such:
pathway-tools -lisp -patho all_genomes_nr/ -disable-metadata-saving &> pathos_all_genomes_nr.log
That will take a little while to run. When it’s complete you can navigate to where Pathway-Tools store your user-generated pathways (probably something like home/you/ptools-local/pgdbs/user). The pathways will be in the all_genomes_nycyc directory. The best way to get at that information is to parse the friendly pathway-reports.txt file.
Update 6/5/15 – I was stuck in file parsing mode when I wrote that last paragraph. Clearly the best way to “get at that information” is to fire up Pathway-Tools, which will let you view all the pathways by hierarchy, and their associated graphics.
Can you execute Pathologic PGDB reconstruction from gbk file using only a console command?
Yes you can, see the last command in this post.
Thanks.
I have another question if you don’t mind.
Do you know what do this PathwayTools flags do:
-disable-metadata-saving
-nologfile
And are there a guide with all the command of PathwayTools? Do you know?
Yes, the best guide is in the manual that is included with the pathway-tools installation.
Do you have any links to the files you produce here? More specifically, what is the format of the file that you give to pathway-tools?
Brandon,
No links, and the format of the files are simply those specified by the pathway-tools module pathologic. To make this easier I’ve built the method into the paprica pipeline that I’ve been working on. Use the paprica-mg_build.py and paprica-mg_run.py scripts available from the development version of paprica on Github. Note that this requires you to build the paprica database first, which takes some dependencies and time. If you don’t want to do that you can deconstruct the paprica-mg.py script, but I’d recommend building the database. Be aware that at this time there is a bug in pathway-tools that prevents PGDBs being built in parallel. I think the pathway-tools people are working on this and hopefully there will be a fix out very soon. Until this is fixed it takes a very long time to build the paprica database.
Jeff,
Thanks for the detailed answer! I’ll read through the publication’s M&M, which admittedly I haven’t done yet.
Cheers,
Brandon
I have not, but I’ll check it out. Looks interesting from the paper. I like having a modular approach and being able to use the MetaCyc ontology – otherwise this looks like what I’ve been missing! http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002358
Hope sounds interesting to you! greetings.
Have you tried HUMAnN ? http://huttenhower.sph.harvard.edu/humann