Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden.
Faculty of Computer Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, Pakistan.
Mol Biol Evol. 2019 Oct 1;36(10):2340-2351. doi: 10.1093/molbev/msz142.
Multiple sequence alignment (MSA) is ubiquitous in evolution and bioinformatics. MSAs are usually taken to be a known and fixed quantity on which to perform downstream analysis despite extensive evidence that MSA accuracy and uncertainty affect results. These errors are known to cause a wide range of problems for downstream evolutionary inference, ranging from false inference of positive selection to long branch attraction artifacts. The most popular approach to dealing with this problem is to remove (filter) specific columns in the MSA that are thought to be prone to error. Although popular, this approach has had mixed success and several studies have even suggested that filtering might be detrimental to phylogenetic studies. We present a graph-based clustering method to address MSA uncertainty and error in the software Divvier (available at https://github.com/simonwhelan/Divvier), which uses a probabilistic model to identify clusters of characters that have strong statistical evidence of shared homology. These clusters can then be used to either filter characters from the MSA (partial filtering) or represent each of the clusters in a new column (divvying). We validate Divvier through its performance on real and simulated benchmarks, finding Divvier substantially outperforms existing filtering software by retaining more true pairwise homologies calls and removing more false positive pairwise homologies. We also find that Divvier, in contrast to other filtering tools, can alleviate long branch attraction artifacts induced by MSA and reduces the variation in tree estimates caused by MSA uncertainty.
多序列比对 (MSA) 在进化和生物信息学中无处不在。尽管有大量证据表明 MSA 的准确性和不确定性会影响结果,但通常认为 MSA 是一个已知且固定的数量,可以在此基础上进行下游分析。这些错误已知会导致下游进化推断出现广泛的问题,从错误推断正选择到长枝吸引artifact。处理这个问题最流行的方法是删除(过滤)MSA 中被认为容易出错的特定列。尽管这种方法很流行,但它的效果参差不齐,一些研究甚至表明过滤可能对系统发育研究有害。我们提出了一种基于图的聚类方法来解决 MSA 中的不确定性和误差,该方法在软件 Divvier(可在 https://github.com/simonwhelan/Divvier 上获得)中使用概率模型来识别具有强烈同源共享统计证据的字符聚类。然后,可以使用这些聚类从 MSA 中过滤字符(部分过滤)或在新列中表示每个聚类(分割)。我们通过其在真实和模拟基准上的性能验证了 Divvier,发现 Divvier 通过保留更多真实的两两同源调用并去除更多假阳性的两两同源调用,大大优于现有的过滤软件。我们还发现,与其他过滤工具相比,Divvier 可以减轻 MSA 引起的长枝吸引artifact,并减少由 MSA 不确定性引起的树估计值的变化。