Wednesday, July 29, 2015

When we fail MrBayes (my version)


In the last few weeks I’ve been extensively working with basic molecular phylogenetics again. My previous experience with this topic has been mostly positive, with reasonably well behaved datasets (I think). However, thanks to the reviewers of the paper that I’m trying to publish, I came to realize that we, the community of random mortal little researchers working with non-model and typically diverse organisms, are doing things rather wrong. 

I feel that phylogenetics (like many other topics in science nowadays) have become so complex in theoretical terms, but at the same time so accessible through easy-peasy PCR and friendly programs like RAxML, that the end-users are just blindly applying default values and getting quick (and dirty) phylogenies for their favorite organisms. Maybe I'm exaggerating, and people for the most part know what they are doing, but I don’t think this is reflected on the papers published.

An old post (the original "When we fail MrBayes, Part II") from 2009 written by Dr. Dan Rabosky (whom I had the pleasure to personally meet in a course he was giving, although I was just one among 30 other grad students) truly opened my eyes. I think the blog in question is not longer working, which is a terrible shame since one can quickly see that the people talking there are “elite” in phylogenetic methods. Anyway, the point for me with that particular post was a central comment: 

“On the whole, the results of this survey do not encourage me. Is our research community doing enough to diagnose convergence failure in MCMC analyses? How severe is this problem? Maybe I’m making a mountain out of a molehill here based on my own experience with a few poorly-behaved datasets. But looking at the literature, it is hard to convince myself that most studies are adequately diagnosing convergence problems, and I can’t help but feel a bit unsettled by all of this.”

Frankly, I don’t think this issue has improved now, in 2015. I see the papers that I cite about corals or fungi, and really...we say very little. Ironically (or maybe not), the papers that are most explicit about their actual practices are in very low impact journals that are relevant for the very few people that care about these groups (e.g. The Lichenologist). I think this is related to the fact than in the NGS era you cannot publish a plain old phylogeny anymore. It's as if in our frenzy of spending tons of money for billions of reads, we've forgotten than 1) you need still to collect very difficult samples of obscure species and field work is still expensive, and 2) sometimes the best thing you can get is a bloody nuclear ribosomal marker. So... yeah.

In his post, Rabosky argues that we should for sure be using AWTY ("Are We There Yet?"... I know, it's awesome..), a “system for graphical exploration of Markov chain Monte Carlo (MCMC) convergence in Bayesian phylogenetic inference”. Unlike Tracer, AWTY actually looks at convergence in topology, the one thing that most of us care about!! On my own I tried to understand AWTY’s output based on the original paper and Rabosky et al’s comments in that post. At this point I think I kinda get it, but it’s just me being self-thought. 

In any case, my current coral dataset is not well behaved and that gives me headaches. I did the relevant analyses for the paper back in 2013, and truth is, the MCMC chains were a disaster:



Funny enough, no matter what I did, the topology was always the same (and the same one as the RAxML tree)…Years later I came back to the data set, I run MrBayes with the same data block, and PUM! The horrible jumps are gone (and the topology was still the same)! It took me forever to understand what was the difference, silly me. 

As it turns out, back in 2013 I was using MrBayes v. 3.2.1 in an old iMac. Now I’ve been using the new MrBayes 3.2.4 x64 (in a cluster). Maybe it’s just me, but among the million things that MrBayes authors might be working on, the one thing that was different and relevant for my study was the prior for the branch lengths! The new prior, described by Rannala et al (2012), seems to be definitely better than the old exponential distribution with rate of 10.0 (and therefore an expected mean branch length of 0.1). I look at my current tree and clearly the branches are not even close to 0.1, they are much shorter. I suspect that was messing up the estimates of the model of molecular evolution...

In any case, the real lessons here are:
- No, you cannot just go ahead and just use the default temperature and default priors so easily.
- We should reaaaally be reporting things like EES, PSRF, average standard deviation of split frequencies values, along with Tracer and AWTY plots. 

There is more to life than the model chosen by JModelTest, people!

Anyway, check out Rannala et al’s paper. Even if you are not really familiar with the actual mathematics (I’m not, for one thing), you will learn a thing or two.

Another terrific paper that I recently encountered is Marchall et al. (2006). It explicitly talks about working with MrBayes, which I loved!


Rannala, B., Zhu, T., and Yang, Z. (2012). Tail paradox, partial identifiability, and influential priors in Bayesian branch length inference. Molecular Biology and Evolution, 29(1):325-335.

Marchall, D. C., Simon, C., and Buckley, T. R. (2006) Accurate branch length estimation in partitioned Bayesian analyses requires accommodation of among-partition rate variation and attention to branch length priors. Systematic Biology, 55(6):993-1003.

No comments:

Post a Comment