Parsing blast xml output

In the previous post I described a method for efficiently blasting a medium-large dataset (33 million sequences).  One down side to using blast on a dataset this size is the very large size of the output xml file, in this case over 40 Gbytes.  The size of the output can be reduced by specifying the table output format (-outfmt 6) when executing blast, but this greatly reduces the amount of information you receive from the run.  If you are going to the trouble of running blast on 33 million sequences you may as well retain all the information possible from each alignment, reducing the chance of needing to blast the dataset again!  The xml format (-outfmt 5) is a (reasonably) user friendly and rich blast output format.  If you are new to bioinformatics however it can be a little daunting – the critical information (best hit, e value, etc.) can look pretty buried in a mysterious jumble of tabs and tags.  Here’s a chunk of the aforementioned xml output, corresponding to a single search:

  <BlastOutput_iterations>
      <Iteration_iter-num>1
      <Iteration_query-ID>Query_1
      <Iteration_query-def>HWI-ST1035:124:D1H8TACXX:2:1101:1236:2147
      <Iteration_query-len>70
      <Iteration_hits>
      <Iteration_stat>
          <Statistics_db-num>17771239
          <Statistics_db-len>44780205193
          <Statistics_hsp-len>30
          <Statistics_eff-space>1769882720920
          <Statistics_kappa>0.46
          <Statistics_lambda>1.28
          <Statistics_entropy>0.85
      <Iteration_message>No hits found

What we have here is some kind of nested format for storing information, identified by tags (<some_tag>), which is all that xml is.  Subordinate pieces of information are more indented than their parents, so in this example all the information belongs to BlastOutput_iterations.  If you are familiar with blast you will recognize some of the tags.  Iteration_query-def for example, gives the name of the query (the cumbersome read id HWI-ST1035:124:D1H8TACXX:2:1101:1236:2147).  The tag Iteration_message is self explanatory, it tells us that this iteration returned no significant matches in the database.  Had there been a match additional tags detailing the match (or matches) would have been included.

Fortunately this use of tags makes it pretty easy to parse blast xml output using regular expressions.  I wrote a short script to convert my massive xml output into a less massive (but still a little cumbersome) tab delimited file from which I can extract specific lines corresponding to a hit using grep.  The basic strategy for scripting xml is to create a variable for each tag of interest and march line by line through the xml, re-saving the variable anytime you encounter the corresponding tag of interest.  When the outermost (least subordinate) tag of interest is encountered (or the last in a known sequence of tags is encountered, as done here), print each variable to a line in an output file.

import re

output = open('test_blast.txt','w')
n = 0
print >> output, 'read'+'\t'+'hit_def'+'\t'+'hit_acc'+'\t'+'e'

with open('blast.xml','r') as xml:
        for line in xml:
            if re.search('<Iteration_query-def>', line) != None:
                n = n + 1
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Iteration_query-def>')
                line = line.rstrip('</')
                query_def = line
            if re.search('No hits found', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Iteration_message>')
                line = line.rstrip('</')
                print n
                print >> output, query_def+'\t'+line
            if re.search('<Hit_def>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Hit_def>')
                line = line.rstrip('</')
                hit_def = line
            if re.search('<Hit_accession>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Hit_accession>')
                line = line.rstrip('</')
                hit_acc = line       
            if re.search('<Hsp_evalue>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Hsp_evalue>')
                line = line.rstrip('</')
                e_val = line
                print n
                print >> output, query_def+'\t'+hit_def+'\t'+hit_acc+'\t'+e_val 

output.close()

If, having successfully parsed the xml file, I want to access all the reads from my metagenome with a hit containing “nifH” in the definition I can extract that information relatively quickly from my table using:

grep 'nifH' test_blast.txt

Since the information that grep returns will include the query id, a new fasta containing the reads with a blast hit of “nifH” is just a short script away.  One tip… since most of the reads will not have a blast hit you can create a new, smaller table containing only those reads with hits by identifying lines with an e value.

 grep 'e-' test_blast.txt > small_test_blast.txt
32790 Total Views 4 Views Today
This entry was posted in Research. Bookmark the permalink.

2 Responses to Parsing blast xml output

  1. Ben says:

    Since I am reading this today, some might be tempted to use this in 2023: be careful when using .strip(foo) and .rstrip(bar)

    They do not remove the foo and bar prefixes/suffixes, they would remove from the ends of the string (or only the tail when using rstrip) every character belonging in the set ‘foo’ or ‘bar’.

    Example :
    ”’
    >>> s = “id_seed”
    >>> s.strip(“”).strip(“</")
    'seed'
    '''

    …and the "id_" is lost!

  2. Avatar photo Jeff says:

    I just learned a great trick that makes this a whole lot easier. Python has a module (gzip) that allows gzipped files to be read as text files, e.g.

    import gzip
    with gzip.open('really_big_file.xml.gz', 'rb') as xml:
       for line in xml:
          do some stuff

    This allows massive blast output files to stay compressed at about 1/5 their inflated size. I’ve placed an updated script here. The updated script also very niftily creates a fasta of hits, as aa seqs if blastx was used.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield