Mega megablast – conducting a massively parallel blast

The blast algorithm is, with good reason, one of the most widely used tools in bioinformatics.  It allows a reasonably quick comparison between a set of sequences and a (potentially very large) database or set of sequences in a fasta file (slower than using a formatted database).  For very small datasets the NCBI maintains an online blast interface, for datasets with hundreds to thousands of sequences it is best to install the blast executable and databases locally.  A wide selection of pre-formatted database are available on the NCBI ftp site.  But what about using blast for metagenomic analysis?  The relatively small metagenome that I’m working on has 33,000,000 sequence reads, after quality control.  Blast is fast considering the vast number of calculations required to compare all of these reads against even a modestly sized database, but it would still take over a week to run this analysis using blastn (the simplest flavor of blast) against the NCBI nt (nucleotide) database on a single computation node with 12 cpus.  There is a parallel version of blast that makes use of the MPI libraries to split the analysis across multiple nodes, but compiling parallel blast looks a little daunting (I haven’t tried yet).  Fortunately there is a better solution available for users of the UW’s Hyak cluster and similar systems.

On Hyak individual investigators, research groups, or academic units purchase computation nodes for three year deployments.  Because the computational needs of our group are limited – I’m currently the only one doing bioinformatics – we don’t own any nodes.  Fortunately the College of the Environment owns 11 nodes for trial use by investigators who might purchase nodes of their own.  They also allow these nodes to be used by students in research groups that don’t have a long term need for nodes, or for whom purchasing nodes isn’t economically feasible (thank you COE!).  This means that I have access to 11 12 processor nodes.  As a free loader on the system I try to tread lightly on these resources, and mostly succeed.  To blast my metagenome on this system I would need to tie up all the COE nodes for days which, in addition to being rude, would certainly jeopardize my continued use of Hyak.

For this kind of analysis the real power of Hyak comes in access to the backfill queue.  The backfill queu allows users to submit any number of jobs to any idle node in Hyak, whether they own the node or not.  This doesn’t help with memory intensive operations like assembly, but it works great on computationally intensive operations like blast.  The basic idea is to split my fasta file of 33,000,000 reads into smaller files that will blast in less than four hours (a requirement for use of the backfill queu).  A little testing determines that 330,000 reads will blast, using megablast against the NCBI nt database and all 12 cpus available on a single node, in about an hour and a half.  Great!  I split my fasta using (script can be found here):

python split_large_multifasta.py 44_trimmed_combined_h20.fasta 100

Then I use a second Python script to create a shell script for each file, and submit that shell script to the backfill queue:

import subprocess
for i in range(0,101):
	i = str(i)
	output = open(i+'.sh', 'w')
	print >> output, '#!/bin/bash'
	print >> output, '#PBS -N "'+i+'"'
	print >> output, '#PBS -d /gscratch/coenv/bowmanjs/barrow_metagenome_working/large_blast'
	print >> output, '#PBS -l nodes=1:ppn=12,feature=12core,walltime=4:00:00'
	print >> output, '#PBS -M bowmanjs@u.washington.edu'
	print >> output, '#PBS -m abe'
	print >> output, 'cp split_fasta/44_trimmed_combined_h20.fasta_'+i+'.fasta /gscratch/coenv/bowmanjs/databases/'
	print >> output, 'cd /gscratch/coenv/bowmanjs/databases'
	print >> output, 'blastn -task megablast -max_target_seqs 5 -evalue 1e-5 -db nt -outfmt 5 -out '+i+'.xml -query 44_trimmed_combined_h20.fasta_'+i+'.fasta -num_threads 12'
	print >> output, 'mv '+i+'.xml ../barrow_metagenome_working/large_blast/'
	print >> output, 'rm 44_trimmed_combined_h20.fasta_'+i+'.fasta'
	output.close()
	subprocess.call('chmod a+x '+i+'.sh', shell = True)
	subprocess.call('qsub -q bf '+i+'.sh', shell = True)

Notice that there’s a funny bit in the above script where the query fasta file is copied to the directory with the database, and then deleted after the analysis. This is because getting blast to look for the database in anything other than the working directory is quite difficult (one bad quirk of blast – supposedly you can fix it in your blast profile but this never works for me), I find it easier just to work in the directory where I stash all my database files and then move everything out. This isn’t optimal but it works. At any rate executing the above script creates and submits 100 jobs to the backfill queue. There are some minor delays while the queuing system sorts itself out, and while waiting for idle nodes, but the whole thing is complete in about 4 hours. Combine all the xml outputs with:

cat *.xml > combined.xml

Now the bad part. At this point you have a behemoth xml file with all your data, 40+ gigabytes in my case. The bits you want can be pulled out with a new script which I’ll detail in a separate post.

8404 Total Views 4 Views Today
This entry was posted in Research. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield