Sunday, May 18, 2014

Restricting a BLAST search to particular taxa or type of sequences

So the other day we got worried that we may have some contamination in our raw data of the fungal genome that we are trying to assembly. In theory nothing should be there but the fungus... but what if there are, say, some bacteria sequences that shouldn't be there x_x!??

We decided to do a BLAST search with some unitigs* that were assembled by the software ABySS (Simpson et al. 2009), restricting the hits to Bacteria sequences. To accomplish this, I used BLAST+ which surprisingly has a rather dense and not-easy-for-the-uninitiated documentation. Here are some starters:

http://www.bioinformaticsbox.com/blast-blastall-tutorial/
ftp://ftp.ncbi.nih.gov/blast/documents/formatdb.html
http://www.bioperl.org/wiki/HOWTO:PhylogeneticAnalysisPipeline#All_vs_All
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node1.html

For the current task, I ended up using the flag "-l" (lower case L) that "Restricts the search of a database to the subset specified by GIs in the input file" according to:
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node53.html

So you run it like this:
blastall -p blastn -d nt -i query.fasta -o output_blast.txt -a 4 -F 'm D' -m 8 -e 1e-5 -l restriction_file.txt

where:
nt -- is the NCBI database that is already in the cluster that I'm using
-a -- are the number of processors used
-F -- Specifies filter(s) to be used to mask query sequence**. In this case, 'm S' for proteins, 'm D' for nucleotides.
-m -- The output format, in this case it's given in Tabular file (8)
-e -- Specifies Expectation value cutoff
-l -- Restricts the search of a database to the subset specified by GIs in the input file

To get my GI file I just went to Genbank (http://www.ncbi.nlm.nih.gov/genbank/), I typed "Bacteria" in the search box with "Nucleotide" selected. It gave a ton of sequences..from which the first one is an Arabidopsis (wtf?). However, in the right section of "Filter your results" there is a possibility of clicking "Bacteria (11277160)" which is what we wanted in the first place. I clicked on it, and then I clicked on "Send to" --> "File" --> Format "GI List" --> "Create file" and voilĂ !

The file looks like this:

528171613
528816716
374352002
378958130
238801497
.
.
.

The logic should apply all the same if you want particular genes instead of species. Hope this helps somebody :)


* unitig is a contig formed from overlapping unambiguously unique sequences, see also http://helicoverpa.org/glossary/term/2
** uh?... check http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node80.html

Simpson et al. 2009 ABySS: A parallel assembler for short read sequence data, Genome Research 19(6):1117-23