Tuesday, October 7, 2014

How to use reads2snp with ALR files

Now that I'm back in Montpellier, I remembered that I wanted to write something about how to use an interesting program called reads2snp (Tsagkogeorga et al. 2012; Gayral et al. 2013, both open access :D)*.

"reads2snp is a SNP and genotype caller: it predicts the genotype of distinct individuals at distinct positions of a set of sequences based on read mapping / read counts."

The program was developed in the framework of the project PopPhyl, "in which transcriptome data are used to characterize the population genomics of non-model organisms." Unlike other programs, reads2snp makes no prior hypothesis regarding the population frequency of alleles, and it deals with biases in gene expression and hidden paralogy.

You can download it from PopPhyl's website here (at the time of writing, the available version was 2.0 march 2014). Once you unzip the file you already have the binaries for linux (there is also something for mac, but I won't be concerned with it, I guess it's similar). You can directly put the binary in your PATH, or use it as it is:

$ ./reads2snp_2.0.64.bin -v
reads2snp version = 2.0  march 2014

which prints the version. In the following I will assume the binary is in your PATH.

To see the options of the program, just type the name of the program, that I renamed to be just reads2snp:

$ reads2snp

Actually all you need to know is already in the README.md file that comes with the distribution. Unlike the older versions, as input you can give either BAM files or the so-called ALR files, which is a non-standard format explained at the end of the README.md. Because I need the ALR files in particular (the format has some information that I want), I used another program called bam2alr that you can get here.

To transform a single BAM file into an ALR (for just one individual)

$ bam2alr -r genome.fas sorted.bam > genome.alr

Or if you have several (sorted) bam files for different individuals you put their names in a single file. In my example all my files end with ".sorted.bam", so:
$ ls *.sorted.bam > bams
And then do

$ bam2alr -r genome.fas -b bams > genomes.alr

Now, from older versions to the new one, there are fewer options because I think they were trying to do the program simpler to use (and maybe less intimidating), while keeping the parameters that they normally use for PopPhyl as defaults.

So finally, to run it I do 

$ reads2snp -alr genomes.alr -par 1 -opt newton -min 20 -th2 0.01 -tol -fis 0 -pre 0.001 th1 0.95 >> genomes.output

Notice that actually most of the parameters that I use are the defaults (since I'm following PopPhyl specifications) but I would like to make some comments. Either way you could just run it like this (it's the same):

$ reads2snp -alr genomes.alr -min 20 >> genomes.output

So the parameters:
[-par 0|1|2] In previous versions, -par didn't have options. These new version has three, tho, 0:no paraclean     1:stringent     2:fast. We definitely want Paraclean (the program involved in finding the hidden paralogs), and it seems that using option 2 is not worthy (I suspect they even took the option out).

[-opt bfgs|newton]  The ML optimization method used by Paraclean. I think I tried at some point with the bfgs option in older versions and it didn't work for me. Probably now it works, but newton is the default for a reason.

[-min integer] The minimal number of reads to call a genotype. This is the only parameter that I don't keep as default (10). I guess I like to have some more confidence on my genotypes ;).

[-th2 float] Paraclean LRT p-val threshold use to be -thr in older versions.

[-acc|-tol|-for] multi alleles:  accepted|tolerated|forbidden

[-fis positive float]    assumed FIS value, the inbreeding coefficient of an individual (I) relative to the subpopulation (S). Notice that this assumes diploid individuals and Hardy-Weinberg equilibrium.

[-pre] precision on parameters; default=0.001 

[-th1 float]  the genotype posterior probability threshold (default=0.95) use to be -gpv in older versions.










* This program is NOT the same as discoSnp, which use to have a similar name (read2SNPs)

No comments:

Post a Comment