Suppr超能文献

使用最小二乘准则和算法的快速匹配

Fast Dating Using Least-Squares Criteria and Algorithms.

作者信息

To Thu-Hien, Jung Matthieu, Lycett Samantha, Gascuel Olivier

机构信息

Institut de Biologie Computationnelle, LIRMM, UMR 5506 CNRS - Université de Montpellier, France;

Institut de Biologie Computationnelle, LIRMM, UMR 5506 CNRS - Université de Montpellier, France; IGBMC (Institut de Génétique et de Biologie Moléculaire et Cellulaire), INSERM, U596, CNRS, UMR7104, Université de Strasbourg, Illkirch, France;

出版信息

Syst Biol. 2016 Jan;65(1):82-97. doi: 10.1093/sysbio/syv068. Epub 2015 Sep 30.

Abstract

Phylogenies provide a useful way to understand the evolutionary history of genetic samples, and data sets with more than a thousand taxa are becoming increasingly common, notably with viruses (e.g., human immunodeficiency virus (HIV)). Dating ancestral events is one of the first, essential goals with such data. However, current sophisticated probabilistic approaches struggle to handle data sets of this size. Here, we present very fast dating algorithms, based on a Gaussian model closely related to the Langley-Fitch molecular-clock model. We show that this model is robust to uncorrelated violations of the molecular clock. Our algorithms apply to serial data, where the tips of the tree have been sampled through times. They estimate the substitution rate and the dates of all ancestral nodes. When the input tree is unrooted, they can provide an estimate for the root position, thus representing a new, practical alternative to the standard rooting methods (e.g., midpoint). Our algorithms exploit the tree (recursive) structure of the problem at hand, and the close relationships between least-squares and linear algebra. We distinguish between an unconstrained setting and the case where the temporal precedence constraint (i.e., an ancestral node must be older that its daughter nodes) is accounted for. With rooted trees, the former is solved using linear algebra in linear computing time (i.e., proportional to the number of taxa), while the resolution of the latter, constrained setting, is based on an active-set method that runs in nearly linear time. With unrooted trees the computing time becomes (nearly) quadratic (i.e., proportional to the square of the number of taxa). In all cases, very large input trees (>10,000 taxa) can easily be processed and transformed into time-scaled trees. We compare these algorithms to standard methods (root-to-tip, r8s version of Langley-Fitch method, and BEAST). Using simulated data, we show that their estimation accuracy is similar to that of the most sophisticated methods, while their computing time is much faster. We apply these algorithms on a large data set comprising 1194 strains of Influenza virus from the pdm09 H1N1 Human pandemic. Again the results show that these algorithms provide a very fast alternative with results similar to those of other computer programs. These algorithms are implemented in the LSD software (least-squares dating), which can be downloaded from http://www.atgc-montpellier.fr/LSD/, along with all our data sets and detailed results. An Online Appendix, providing additional algorithm descriptions, tables, and figures can be found in the Supplementary Material available on Dryad at http://dx.doi.org/10.5061/dryad.968t3.

摘要

系统发育树为理解基因样本的进化历史提供了一种有用的方法,包含超过一千个分类单元的数据集正变得越来越普遍,尤其是在病毒领域(例如,人类免疫缺陷病毒(HIV))。确定祖先事件的时间是处理这类数据的首要基本目标之一。然而,当前复杂的概率方法在处理这种规模的数据集时面临困难。在此,我们提出了基于与兰利 - 菲奇分子钟模型密切相关的高斯模型的非常快速的定年算法。我们表明该模型对于分子钟的不相关违反具有鲁棒性。我们的算法适用于序列数据,即树的末端是在不同时间进行采样的。它们可以估计替换率以及所有祖先节点的时间。当输入树是无根树时,它们可以为根的位置提供估计,从而代表了一种新的、实用的标准生根方法(例如,中点法)替代方案。我们的算法利用了手头问题的树(递归)结构以及最小二乘法和线性代数之间的紧密关系。我们区分了无约束设置和考虑时间先后约束(即祖先节点必须比其后代节点更古老)的情况。对于有根树,前者在线性计算时间(即与分类单元数量成比例)内使用线性代数求解,而后者的约束设置的求解基于一种在近乎线性时间内运行的活动集方法。对于无根树,计算时间变为(近乎)二次方(即与分类单元数量的平方成比例)。在所有情况下,非常大的输入树(>10,000个分类单元)都可以轻松处理并转换为时间尺度树。我们将这些算法与标准方法(根到末端法、兰利 - 菲奇方法的r8s版本以及BEAST)进行比较。使用模拟数据,我们表明它们的估计精度与最复杂的方法相似,而计算时间要快得多。我们将这些算法应用于一个包含来自pdm09 H1N1人类大流行的1194株流感病毒的大型数据集。结果再次表明,这些算法提供了一种非常快速的替代方法,其结果与其他计算机程序的结果相似。这些算法在LSD软件(最小二乘定年)中实现,该软件可从http://www.atgc - montpellier.fr/LSD/下载,同时还有我们所有的数据集和详细结果。在http://dx.doi.org/10.5061/dryad.968t3的Dryad上提供的补充材料中可以找到一个在线附录,其中提供了额外的算法描述、表格和图表。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ff2/4678253/338a95d7d074/syv068f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验