******************
Updated method posted here: https://www.polarmicrobes.org/?p=859
******************
But they’re not. Getting meaningful taxonomy from blast results is one of those things. Here’s the situation: I ran a blastn search locally against the NCBI nt (nucleotide) database, saving the output in the (somewhat) easy to manipulate xml format. I’m interested in catching as many of the close homologues to my queries as possible, so I allowed up to 500 matches per search. The resulting xml file is pretty substantial. What I want to do now is identify the genus for each of my hits. It would be absolutely fantastic if NCBI would include the taxonomic information in the blast results – if would be easy to do on their end – but they don’t. C’est la vie.
Solution the first (the ugly)
Since the hit definitions (hit_def tag in the xml) contain the scientific name of the species of origin this can be parsed by regular expressions to produce something that is probably, most of the time, the genus name.
with open('big_blast.xml','r') as xml: for line in xml: if re.search('<Hit_def>', line) != None: line = line.strip() line = line.rstrip() line = re.sub('<Hit_def>', '', line) line = re.sub('', '', line) hit_def = line.lower() hit_genus = re.split(' ', hit_def)[0]
But this will on occasion produce some weirdness that is not the genus name, and I doubt the exact format of hit_def is stable over time. It would be nice to tap into the NCBI taxonomy database to get this information right from the source.
Solution the second (the fat)
The NCBI taxonomy database can be acquired from the NCBI ftp site at ftp://ftp.ncbi.nih.gov/pub/taxonomy/. The large gi_taxid_nucl.dmp.gz file is the key to the NCBI taxonomy safe. It is a simple tab delimited text file mapping a gi number to a taxonomy id number. With the taxonomy id number (and a little voodoo) you can obtain a full taxonomy for any gi that has one. Since the gi number is conveniently embedded in the hit id section of the xml (hit_id tag, pry it out with some more regular expressions) the full taxonomy seems so, so close…
A casual user of Python might look at gi_taxid_nucl.dmp (with head, after using gunzip) and think dictionary. If the whole thing can be crammed into a dictionary with the gi numbers as keys, the taxid can be easily obtained for any given gi number. Problem is, it can’t be crammed into a dictionary – at least without a lot of trouble. Python dictionary creation and operation is memory intensive. 24 gigs of RAM is not nearly enough for this. I did manage to create a dictionary on a high memory machine (150 gigs, it failed when I initially set ulimit -v to 100 gigs) by iterating across each line in gi_taxid_nucl.dmp, saving the result as a pickle (a what you say? That’s what I said too. Turns out that pickles, in addition to being tasty fermented cucumbers, are binary flat files. These files can store things like Python lists and dictionaries), but the resulting pickle couldn’t be utilized on the 24 gig machine. I’m a guest on the high memory cluster and am reluctant to hog it for what seems like a simple operation, so I needed a different solution…
Solution the third (the slow)
I don’t have a lot of experience with SQL databases, but I know that this is the intended environment for the NCBI taxonomy. The SQL workhourse mySQL looked like overkill, so I went with sqlite. Python has a nice interface for working, via sqlite, with SQL databases so this seemed like a natural way to go. I created a database for gi_taxid_nucl.dmp as such:
import sqlite3 conn = sqlite3.connect('gi_taxid_nucl.db') c = conn.cursor() c.execute('''CREATE TABLE gi_taxid (gi, taxid)''') with open('gi_taxid_nucl.dmp', 'r') as map_file: for line in map_file: line = line.split() print line[0], line[1] c.execute("INSERT INTO gi_taxid VALUES ("+line[0]+","+line[1]+")") conn.commit() conn.close()
The SQL database isn’t much larger than the original tab-delimited file, and didn’t take too much memory to build. It completed in a couple of hours. To get data back out of the database I tried:
import sqlite3 conn = sqlite3.connect('gi_taxid_nucl.db') c = conn.cursor() t = (123,) for row in c.execute('SELECT * FROM gi_taxid WHERE gi=?', t): print row
That should pull out the row corresponding to the gi number 123. And it does. But it takes a long time. I didn’t time it, but I’m guessing it would take a few weeks to run tens of thousands of queries. I’m not sure if there is a faster way to work with sqlite. It feels like an SQL database should be the solution, but it hasn’t worked out yet…
Solution the fourth (the least ugly, fat, and slow)
The fastest way to parse gi_taxid_nucl.dmp (that I can come up with) is with grep. It takes grep only a second or two to match a gi number to a line in gi_taxid_nucl.dmp, which is pretty amazing. My final script is below. I iterate across a list of gi numbers obtained from my blast output (reformatted into a table, see this post https://www.polarmicrobes.org/?p=753), using grep to match each to a line in gi_taxid_nucl.dmp. From these lines I can create a nice, reasonably sized dictionary from which I can quickly obtain the taxid when I need it. The part of the script that actually generates the taxonomy comes from this post: https://groups.google.com/forum/#!msg/qiime-forum/PXem7gMltao/oKOxJ_zFD4cJ.
from cogent.parse.ncbi_taxonomy import NcbiTaxonomyFromFiles import subprocess subprocess.call('rm blast_taxid.txt', shell = True) print 'building dictionary' n = 0 with open('blast_output_table.txt', 'r') as blast_table: for each in blast_table: n = n + 1 print n each = each.split('\t') subprocess.call('grep --max-count=1 \"'+each[3]+'\" gi_taxid_nucl.dmp | tee -a blast_taxid.txt', shell = True) map_dict = {} with open('blast_taxid.txt') as gi_taxid: for each in gi_taxid: each = each.split() gi = each[0] taxid = each[1] map_dict[gi] = taxid print gi, map_dict[gi] print 'reading in files' tree = NcbiTaxonomyFromFiles(open('nodes.dmp'), open('names.dmp')) root = tree.Root print 'generating lineages' def get_lineage(node, my_ranks): ranks_lookup = dict([(r,idx) for idx,r in enumerate(my_ranks)]) lineage = [None] * len(my_ranks) curr = node while curr.Parent is not None: if curr.Rank in ranks_lookup: lineage[ranks_lookup[curr.Rank]] = curr.Name curr = curr.Parent return lineage ranks = ['genus'] output = open('blast_lineage.txt', 'w') error = open('error.txt','w') with open('blast_output_table.txt', 'r') as input: for line in input: if line[0:5] != 'query': try: line = line.split('\t') taxid = int(map_dict[line[3]]) node = tree.ById[taxid] tax = get_lineage(node, ranks) tax = str(tax[0]).lower() print >> output, line[0]+'\t'+tax+'\t'+line[2]+'\t'+line[5], print line[0], tax.lower() except KeyError: print 'key error', line[3] output.close() error.close()
I watched a show last night featuring researchers who get to dart polar bears from a helicopter, then land and give them medical examinations. It looked like fun, I wonder if they want to switch dissertation topics…
If you are interested in setting up a taxonomy database please read this post describing a better method: https://www.polarmicrobes.org/?p=859
I’ve also found this post very helpful in setting up a taxonomy database. Thanks for putting me on the right path.
I want to let you know one adjustment I made in order to speed up the queries. When creating the database, I adjusted your script above slightly:
c.execute('''CREATE TABLE gi_taxid (
gi INTEGER PRIMARY KEY,
taxid INTEGER)''')
This makes the gi a primary key, and indexes this field in the file – greatly speeding up searches.
sfederman,
I would like to implement your suggestion to (hopefully) speed up the process further, but I don’t understand exactly how to implement it. I am not used to c-code. Could you show where and how you would implement it in Jeff’s code!?
Thanks in advance!
Mialian,
That’s not c, sfederman’s still in Python at that point. If you replace the c.execute line in the Option #3 code with that suggested above it should work. Alternative try this method instead: https://www.polarmicrobes.org/?p=859.
Glad it helped! I’ve made the following two modifications to my use of grep, now using “zgrep -F”. Zgrep allows a search on a gzipped dmp file (dmp.gz) which saves a little disk space. The -F parameter (same as fgrep for search on a non-gzipped file) does not interpret regular expressions (not needed here), speeding up the grep search.
Further modifications… if you use the above fgrep –max-count=1 method you will invariably incorrectly match some lines in the .dmp file (wherever your query gi appears in a target gi). Since –max-count=1 aborts the search after one match you will often not return the line you want. Use grep(z) -P “23435/t” instead. The -P option uses Perl regex, allowing the use of the \t character. This example would match:
23435 46
123435 893
but not:
234357 21456
This effectively reduces the .dmp file to a size that allows dictionary creation, but it will contain a number of erroneous matches (not a problem, because the dictionary will force exact matches downstream).
Thank you so much for posting this. I’ve been struggling with the same problem for a couple of weeks now and it never occured to me to grep for it!