The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, D-68159 Heidelberg, Germany, Graduate School for Computing in Medicine and Life Sciences, University of Lübeck, Institut für Neuro- und Bioinformatik, University of Lübeck, 23538 Lübeck, Germany, Natural History Museum of Crete, University of Crete, GR-71409 Irakleio, Crete, Greece and Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas-FORTH, GR-70013 Heraklion, Crete, Greece.
Bioinformatics. 2013 Nov 15;29(22):2869-76. doi: 10.1093/bioinformatics/btt499. Epub 2013 Aug 29.
Sequence-based methods to delimit species are central to DNA taxonomy, microbial community surveys and DNA metabarcoding studies. Current approaches either rely on simple sequence similarity thresholds (OTU-picking) or on complex and compute-intensive evolutionary models. The OTU-picking methods scale well on large datasets, but the results are highly sensitive to the similarity threshold. Coalescent-based species delimitation approaches often rely on Bayesian statistics and Markov Chain Monte Carlo sampling, and can therefore only be applied to small datasets.
We introduce the Poisson tree processes (PTP) model to infer putative species boundaries on a given phylogenetic input tree. We also integrate PTP with our evolutionary placement algorithm (EPA-PTP) to count the number of species in phylogenetic placements. We compare our approaches with popular OTU-picking methods and the General Mixed Yule Coalescent (GMYC) model. For de novo species delimitation, the stand-alone PTP model generally outperforms GYMC as well as OTU-picking methods when evolutionary distances between species are small. PTP neither requires an ultrametric input tree nor a sequence similarity threshold as input. In the open reference species delimitation approach, EPA-PTP yields more accurate results than de novo species delimitation methods. Finally, EPA-PTP scales on large datasets because it relies on the parallel implementations of the EPA and RAxML, thereby allowing to delimit species in high-throughput sequencing data.
The code is freely available at www.exelixis-lab.org/software.html. .
基于序列的物种界定方法是 DNA 分类学、微生物群落调查和 DNA 宏条形码研究的核心。目前的方法要么依赖于简单的序列相似性阈值(OTU 聚类),要么依赖于复杂的计算密集型进化模型。OTU 聚类方法在大型数据集上具有良好的可扩展性,但结果对相似性阈值非常敏感。基于合并的物种界定方法通常依赖于贝叶斯统计学和马尔可夫链蒙特卡罗抽样,因此只能应用于小数据集。
我们引入泊松树过程(PTP)模型,根据给定的系统发育输入树推断可能的物种边界。我们还将 PTP 与我们的进化定位算法(EPA-PTP)集成,以计算系统发育定位中的物种数量。我们将我们的方法与流行的 OTU 聚类方法和广义混合尤尔合并模型(GMYC)进行比较。对于从头开始的物种界定,当物种之间的进化距离较小时,独立的 PTP 模型通常优于 GMYC 和 OTU 聚类方法。PTP 既不需要超度量输入树,也不需要序列相似性阈值作为输入。在开放参考物种界定方法中,EPA-PTP 比从头开始的物种界定方法产生更准确的结果。最后,EPA-PTP 在大型数据集上具有可扩展性,因为它依赖于 EPA 和 RAxML 的并行实现,从而可以在高通量测序数据中界定物种。
代码可在 www.exelixis-lab.org/software.html 上免费获得。