Somethings should be easy, and they are…

**UPDATE**
Got it.  You have to set the BLASTDB variable for this to work and place the taxonomy files in this same location.  I just did it in my .bash_profile:

BLASTDB=”/path/to/databases”
export BLASTDB

**UPDATE**
The original motivation for this was to obtain taxonomy data for blast search results.  Sometime in the last few months this feature actually slipped into the BLAST+ package.  I can’t actually get the feature to work, but perhaps you can!  It requires the installation of two special taxonomy files found at ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz.  Typically cryptic instructions can be found on the command line blast manual page at http://www.ncbi.nlm.nih.gov/books/NBK1763/, along with the necessary flags to get the taxonomy to appear (ha!) in table format output.  I suspect that my inability to get this to work stems from my dislike for mucking with the blast profile file.  Instead of setting a default database location I use the complete path in my calls.  Regardless of where I locate the taxonomy database blast can’t seem to find it 🙁

****

A couple of months ago I wrote a post on the difficulty of parsing the gi_taxid_nucl.dmp.gz file that maps sequence gi numbers to their NCBI taxonomy number (taxid).  With a taxid it is possible to obtain the actual taxonomy of (some) sequences in Genbank.  I proposed a couple of elaborate solutions, some of which were improved upon in the comments section by other contributors.  One particularly good suggestion was to use primary keys in a SQL database for fast look-ups.

In my latest analysis I have ~ 50,000 gi numbers to map to taxids and had to revisit this issue again.  After some tinkering I realized I’d missed a very obvious and reasonably efficient way to extract the relevant lines from the gi_taxid_nucl/prot.dmp.gz file.  If you have only a few gi numbers to map stick with the SQL or grep solutions.  If you have many give this method a try.  Making use of Python’s with statement it is possible to very quickly search each line of the .dmp.gz against a known collection of gi numbers contained in a set.  I only recently discovered Python sets, and they have already saved me numerous times.  Searching against items in a set is orders of magnitude faster than searching against items in a list.

# assuming that gis is a set of gi numbers...
import gzip
gi_taxid_output = open('small_gi_taxid.txt', 'w')
with gzip.open('gi_taxid_prot.dmp.gz', 'r') as dmp:
    found = 0
    for line in dmp:
        line_split = line.split('\t')
        print line_split[0], found
        if str(line_split[0]) in gis:
            found = found + 1
            print >> gi_taxid_output, line

Using this method a search for 50,000 gi numbers takes ~ 1 hour for gi_taxid_nucl.dmp.gz and about 30 minutes for gi_taxid_prot.dmp.gz.  The grep method, even utilizing the gnu parallel command for an 8 fold speed increase (on an 8 core machine), would have taken at least overnight and produced a messy gi_taxid file with many erroneous hits.

19963 Total Views 9 Views Today
This entry was posted in Research. Bookmark the permalink.

6 Responses to Somethings should be easy, and they are…

  1. Audrey Bourret says:

    Thank you so much, this post allowed me to finally get taxonomy on my blast results …

  2. Lavinia says:

    Thanks Jeff, I was just about to wade through the BLAST manual to see how to implement this, Google led me here – exactly what I was after, thanks.

  3. tallnutt says:

    Hi,

    I’ve had the same problem for a couple of years now, and still not really found the solution. In metagenomics, methods seem to be moving toward using online services like MGRAST, who presumably have their own taxonomy databases. However, I still like to be able to run my own analyses locally. The quickest way I’ve found to do this is to use a binary search of the taxonomy db in python, see code below. I can’t take credit for the algorithm, whih I got from the net somewhere, but sorry its from too many places to remember them all to give proper credit.. This retrieves the tax_id very quickly indeed.. but the slow part is then getting the scientific name from Entrez, which I have only managed to do via their server one at a time (maybe this could be speeded up doing several at a time, but I got errors trying that and gave up so far). Anyway, perhaps the code below will be useful. Theo.

    from Bio import Entrez
    import sys
    import os

    #taxonfetch.py theo allnutt 2014. Gets the taxonomy ID number from a local copy of the NCBI GI/taxonomy list file #and retrieves the scientific name field from entrez. ID and name are then appended to the end of the submitted #blast output (tab format, 6) file.

    Entrez.email = ‘theo.allnutt@csiro.au’
    inputfile = sys.argv[1] #tab format blast output (6) GIs in second column
    outputfile = sys.argv[2] #taxa are added to end columns – name of appended file
    taxdb = sys.argv[3] #specify “p” or “n” for protein or nucleotide taxon list

    #usage e.g. taxonfetch.py infileblast.tab outfile.tab p

    f = open(inputfile,’r’)
    g = open(outputfile,’w’)

    c=0
    dataout = “”
    t=0
    #get the GIs
    gi=[]
    dataout=[]
    ids=[]

    for line2 in f:
    c=c+1
    gi.append(line2.split(‘|’)[1])
    print str(c)+” GIs”
    #print gi
    c=0
    gitax=[]

    names =[]
    t=0

    #CHANGE THESE PATHS TO THOSE FOR YOUR GI – TAX_ID FILES!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    if taxdb == “p”:
    print “reading protein taxonomy \n”
    taxaid = “/OSM/HOME-MEL/all29c/data/db/gi_taxid_prot.dmp”
    if taxdb == “n”:
    print “reading nucleotide taxonomy”
    taxaid = “/OSM/HOME-MEL/all29c/data/db/gi_taxid_nucl.dmp”

    print “searching..”
    size = long(os.path.getsize(taxaid))

    print “TaxID File Size: “, size
    t = open(taxaid,’r’)

    for gis in gi: #binary search #########################################
    c=c+1
    #print gis
    found=False
    offset=0
    chunk=size
    pos=chunk/2
    while found == False and chunk>0:
    #print “posn: “, pos
    chunk = chunk/2
    #print “chunk:”, chunk
    #print “offset: “, offset
    t.seek(pos)
    t.readline()
    entry = t.readline().split(“\t”)
    if entry[0]:
    filegi = int(entry[0])
    filetax = int(entry[1].rstrip(“\n”))

    #print gis, filegi, filetax
    #raw_input()

    if filegi == int(gis):
    answer = filetax
    #print filegi, ” FOUND”
    found = True
    #print ‘chunk:’,chunk,’ offset:’,offset,’ posn:’,pos
    print c,”:”,filegi, answer
    elif filegi > int(gis):
    pos = offset +(chunk/2)

    elif filegi < int(gis):
    offset = offset+chunk
    pos = pos + (chunk/2)

    if found == False:
    answer = "00000"
    print c,":",filegi, answer, "not found"

    gitax.append(str(answer))

    #print gitax #list of Gi's taxids
    cc1=0
    for ids in gitax:
    cc1=cc1+1
    if ids =="00000":
    names.append("No Taxon"+"\t"+ids)
    print ids,"No Taxon"
    else:
    try:
    entryData = Entrez.efetch(id = ids, db = "Taxonomy", retmode = "xml")
    data1=Entrez.read(entryData) #list of dictionaries of stunning complexity – only one entry [0] required
    names.append(data1[0]["ScientificName"]+"\t"+data1[0]["TaxId"]) #fetch only one at a time?
    print cc1,ids, "retrieved",data1[0]["ScientificName"]
    except:
    names.append("not retrieved"+"\t"+ids)
    print ids," not retrieved"
    #add data to inputfile
    n=-1
    f.seek(0)
    for line4 in f:
    n=n+1
    print n
    print line4.rstrip('\n')+"\t"+names[n]+"\n"
    g.write(line4.rstrip('\n')+"\t"+names[n]+"\n")

    g.close()
    f.close()

  4. bfreed05 says:

    Hello Jeff,
    I’m using the taxdb system myself to characterize some environmental contigs I’ve collected. I’m using a linux machine for my first time for this, and have run into the same problem where the taxdb file is out of path.
    What is the BLASTDB variable? My set up goes as follows: ~/ncbi-blast-2.2.29+/db is the folder containing my taxdb.btd and .bti file. The actual blast program runs in ~/ncbi-blast-2.2.29+/bin. Both locations are added to my PATH, but it doesn’t recognize the taxon files and leaves a NA when i request “scientific names” in my output format. taxid displays correctly. Any help is greatly appreciated.

    • bfreed05 says:

      Finally figured it out. My home directory (~/) needed a file titled, ” .ncbirc” with no space or letters before the period. I had to check “display hidden files” in order to see it once it was created. In this .ncbirc file, I created wrote the following text: ” [BLAST] BLASTDB=~/ncbi-blast-2.2.29+/db/” and saved. After that, everything worked normally.

      • Avatar photo Jeff says:

        Yep, that should do it! I chose to do it in my .bash_profile but I believe the end result is the same. Down toward the end of my bash profile I have:

        BLASTDB="/where/I/keep/databases"
        export BLASTDB

        Glad it’s working for you.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield