We decided to do a BLAST search with some unitigs* that were assembled by the software ABySS (Simpson et al. 2009), restricting the hits to Bacteria sequences. To accomplish this, I used BLAST+ which surprisingly has a rather dense and not-easy-for-the-uninitiated documentation. Here are some starters:
http://www.bioinformaticsbox.com/blast-blastall-tutorial/
ftp://ftp.ncbi.nih.gov/blast/documents/formatdb.html
http://www.bioperl.org/wiki/HOWTO:PhylogeneticAnalysisPipeline#All_vs_All
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node1.html
For the current task, I ended up using the flag "-l" (lower case L) that "Restricts the search of a database to the subset specified by GIs in the input file" according to:
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node53.html
So you run it like this:
blastall -p blastn -d nt -i query.fasta -o output_blast.txt -a 4 -F 'm D' -m 8 -e 1e-5 -l restriction_file.txt
where:nt -- is the NCBI database that is already in the cluster that I'm using
-a -- are the number of processors used
-F -- Specifies filter(s) to be used to mask query sequence**. In this case, 'm S' for proteins, 'm D' for nucleotides.
-m -- The output format, in this case it's given in Tabular file (8)
-e -- Specifies Expectation value cutoff
-l -- Restricts the search of a database to the subset specified by GIs in the input file
To get my GI file I just went to Genbank (http://www.ncbi.nlm.nih.gov/genbank/), I typed "Bacteria" in the search box with "Nucleotide" selected. It gave a ton of sequences..from which the first one is an Arabidopsis (wtf?). However, in the right section of "Filter your results" there is a possibility of clicking "Bacteria (11277160)" which is what we wanted in the first place. I clicked on it, and then I clicked on "Send to" --> "File" --> Format "GI List" --> "Create file" and voilĂ !
The file looks like this:
528171613
528816716
374352002
378958130
238801497
.
.
.
The logic should apply all the same if you want particular genes instead of species. Hope this helps somebody :)528816716
374352002
378958130
238801497
.
.
.
* unitig is a contig formed from overlapping unambiguously unique sequences, see also http://helicoverpa.org/glossary/term/2
** uh?... check http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node80.html
Simpson et al. 2009 ABySS: A parallel assembler for short read sequence data, Genome Research 19(6):1117-23
No comments:
Post a Comment