Evo-Improvising: 2014

Tuesday, October 7, 2014

How to use reads2snp with ALR files

Now that I'm back in Montpellier, I remembered that I wanted to write something about how to use an interesting program called reads2snp (Tsagkogeorga et al. 2012; Gayral et al. 2013, both open access :D)*.

"reads2snp is a SNP and genotype caller: it predicts the genotype of distinct individuals at distinct positions of a set of sequences based on read mapping / read counts."

The program was developed in the framework of the project PopPhyl, "in which transcriptome data are used to characterize the population genomics of non-model organisms." Unlike other programs, reads2snp makes no prior hypothesis regarding the population frequency of alleles, and it deals with biases in gene expression and hidden paralogy.

You can download it from PopPhyl's website here (at the time of writing, the available version was 2.0 march 2014). Once you unzip the file you already have the binaries for linux (there is also something for mac, but I won't be concerned with it, I guess it's similar). You can directly put the binary in your PATH, or use it as it is:

$ ./reads2snp_2.0.64.bin -v
reads2snp version = 2.0 march 2014

which prints the version. In the following I will assume the binary is in your PATH.

To see the options of the program, just type the name of the program, that I renamed to be just reads2snp:

$ reads2snp

Actually all you need to know is already in the README.md file that comes with the distribution. Unlike the older versions, as input you can give either BAM files or the so-called ALR files, which is a non-standard format explained at the end of the README.md. Because I need the ALR files in particular (the format has some information that I want), I used another program called bam2alr that you can get here.

To transform a single BAM file into an ALR (for just one individual)

$ bam2alr -r genome.fas sorted.bam > genome.alr

Or if you have several (sorted) bam files for different individuals you put their names in a single file. In my example all my files end with ".sorted.bam", so:

$ ls *.sorted.bam > bams

And then do

$ bam2alr -r genome.fas -b bams > genomes.alr

Now, from older versions to the new one, there are fewer options because I think they were trying to do the program simpler to use (and maybe less intimidating), while keeping the parameters that they normally use for PopPhyl as defaults.

So finally, to run it I do

$ reads2snp -alr genomes.alr -par 1 -opt newton -min 20 -th2 0.01 -tol -fis 0 -pre 0.001 th1 0.95 >> genomes.output

Notice that actually most of the parameters that I use are the defaults (since I'm following PopPhyl specifications) but I would like to make some comments. Either way you could just run it like this (it's the same):

$ reads2snp -alr genomes.alr -min 20 >> genomes.output

So the parameters:
[-par 0|1|2] In previous versions, -par didn't have options. These new version has three, tho, 0:no paraclean 1:stringent 2:fast. We definitely want Paraclean (the program involved in finding the hidden paralogs), and it seems that using option 2 is not worthy (I suspect they even took the option out).

[-opt bfgs|newton] The ML optimization method used by Paraclean. I think I tried at some point with the bfgs option in older versions and it didn't work for me. Probably now it works, but newton is the default for a reason.

[-min integer] The minimal number of reads to call a genotype. This is the only parameter that I don't keep as default (10). I guess I like to have some more confidence on my genotypes ;).

[-th2 float] Paraclean LRT p-val threshold use to be -thr in older versions.

[-acc|-tol|-for] multi alleles: accepted|tolerated|forbidden

[-fis positive float] assumed FIS value, the inbreeding coefficient of an individual (I) relative to the subpopulation (S). Notice that this assumes diploid individuals and Hardy-Weinberg equilibrium.

[-pre] precision on parameters; default=0.001

[-th1 float] the genotype posterior probability threshold (default=0.95) use to be -gpv in older versions.

* This program is NOT the same as discoSnp, which use to have a similar name (read2SNPs)

Saturday, October 4, 2014

Going to the beach in Montpellier

Today I went to one of the beaches near Montpellier. Just so I don't forget how to, I want just to describe how to get there.

These are instructions to get to Grand Travers beach on a Saturday. The bus times change depending on the day. Check this website for the time tables.

First, I took the Tram line 1 (the blue one) towards Odysseum. I got down at the station before Odysseum: Place de France.

There, I looked for a bus stop (there are at least two, the one towards Odysseum but on the other side of the tram line). The bus that I was looking for is number 106, and the ticket is 1.60€ one way. I took it at 11:30am. The normal transport card from Montpellier doesn't work for this.

Then I just waited for the stop that I was interested in (you have to hit Stop or the bus won't stop!). I think the bus can go all the way to La Grande Motte, which is supposed to be a nice beach, but I went down before that in the Grand Travers.

According to this website, Grand Travers is not so great, but I arrived in the morning (around 11:10am) in a cloudy day and it was pretty good. A couple of families here and there. The day got super good later, and then a lot of people appeared, but still, it was nice. The particular spot next to the bus stop is not close to any building, which can be good or bad depending on your needs. I think there are some restaurant somewhere, but I didn't look for them. There was one boy selling ice cream at some point. I would anyway recommend to bring your own food. That's what a lot of families and myself did.

I returned early, taking the bus at 16:22. But there are a few other buses later. Be careful because the buses don't run very often.

Saturday, September 6, 2014

War in the world of lichens

The quote of the day:

"(...) in fact there is war in the world of lichens. Following a brief consideration of parasitism and other forms of symbiosis, this war will be illustrated under the headings of algal slaves, exploiters of two kingdoms, alien invaders, cosy niche seekers and take-over specialists."

Richardson, D.H.S. (1999) War in the world of lichens: parasitism and symbiosis as exemplified by lichens and lichenicolous fungi, Mycol. Res. 103(6): 641-650.

Thursday, August 28, 2014

Careless lichenology

Today I was reading about taxonomy and lichens. I found something that I think is very fitting:

"(...) as stated by Santesson (1954), 'in no other group of plants is the taxonomic work carried out more carelessly than in lichenology'."

Honegger R (1983) The Ascus Apex in Lichenized Fungi IV. Baeomyces and Icmadophila in Comparison with Cladonia (Lecanorales) and the Non-Lichenized Leotia (Helotiales), Lichenologist 14: 205-217

I think the modern lichenologist are no longer like that, tho.

Sunday, May 18, 2014

Restricting a BLAST search to particular taxa or type of sequences

So the other day we got worried that we may have some contamination in our raw data of the fungal genome that we are trying to assembly. In theory nothing should be there but the fungus... but what if there are, say, some bacteria sequences that shouldn't be there x_x!??

We decided to do a BLAST search with some unitigs* that were assembled by the software ABySS (Simpson et al. 2009), restricting the hits to Bacteria sequences. To accomplish this, I used BLAST+ which surprisingly has a rather dense and not-easy-for-the-uninitiated documentation. Here are some starters:

http://www.bioinformaticsbox.com/blast-blastall-tutorial/
ftp://ftp.ncbi.nih.gov/blast/documents/formatdb.html
http://www.bioperl.org/wiki/HOWTO:PhylogeneticAnalysisPipeline#All_vs_All
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node1.html

For the current task, I ended up using the flag "-l" (lower case L) that "Restricts the search of a database to the subset specified by GIs in the input file" according to:
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node53.html

So you run it like this:

blastall -p blastn -d nt -i query.fasta -o output_blast.txt -a 4 -F 'm D' -m 8 -e 1e-5 -l restriction_file.txt

where:
nt -- is the NCBI database that is already in the cluster that I'm using
-a -- are the number of processors used
-F -- Specifies filter(s) to be used to mask query sequence**. In this case, 'm S' for proteins, 'm D' for nucleotides.
-m -- The output format, in this case it's given in Tabular file (8)
-e -- Specifies Expectation value cutoff
-l -- Restricts the search of a database to the subset specified by GIs in the input file

To get my GI file I just went to Genbank (http://www.ncbi.nlm.nih.gov/genbank/), I typed "Bacteria" in the search box with "Nucleotide" selected. It gave a ton of sequences..from which the first one is an Arabidopsis (wtf?). However, in the right section of "Filter your results" there is a possibility of clicking "Bacteria (11277160)" which is what we wanted in the first place. I clicked on it, and then I clicked on "Send to" --> "File" --> Format "GI List" --> "Create file" and voilà!

The file looks like this:

528171613
528816716
374352002
378958130
238801497
.
.
.

The logic should apply all the same if you want particular genes instead of species. Hope this helps somebody :)

* unitig is a contig formed from overlapping unambiguously unique sequences, see also http://helicoverpa.org/glossary/term/2
** uh?... check http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node80.html

Simpson et al. 2009 ABySS: A parallel assembler for short read sequence data, Genome Research 19(6):1117-23

Sunday, April 6, 2014

Upgrading to R 3.0.3 in my Mac OS X 10.7.5 Lion

Ok, so as my first post I would like to comment a bit about upgrading R in my MacBook Pro. I did this in order to complete the assignment of Week 4 for the course of Bioinformatic Methods II in Coursera (www.coursera.org).

I found a nice post about this but for Leopard in here:
http://onertipaday.blogspot.se/2008/10/r-upgrade-on-mac-os-x-1055-leopard.html

Along with
http://r.research.att.com/man/RMacOSX-FAQ.html#How-can-R-for-Mac-OS-X-be-obtained-and-installed_003f

To reinstall packages from an old version of R to a new one, go into the terminal:

$ R
> tmp <- installed.packages()
> installedpkgs <- as.vector(tmp[is.na(tmp[,"Priority"]), 1])
> save(installedpkgs, file="installed_old.rda")
> quit()

I downloaded the version 3.0.3 of R from http://cran.r-project.org/bin/macosx/
To erase the old R version:

$ sudo rm -rf /Library/Frameworks/R.framework /Applications/R.app /Applications/R64.app
$ sudo rm -f /usr/bin/R /usr/bin/Rscript /usr/bin/R32 /usr/bin/R64

I installed the new version just with double-clicking to the downloaded package. From here I just followed all the commands of Paolo in the first link provided above.