从基因顺序数据扩大准确的系统发育重建。

Scaling up accurate phylogenetic reconstruction from gene-order data.

作者信息

Tang Jijun, Moret Bernard M E

机构信息

Department of Computer Science, University of New Mexico, Albuquerque, NM 87131, USA.

出版信息

Bioinformatics. 2003;19 Suppl 1:i305-12. doi: 10.1093/bioinformatics/btg1042.

DOI:10.1093/bioinformatics/btg1042

PMID:12855474

Abstract

MOTIVATION

Phylogenetic reconstruction from gene-order data has attracted increasing attention from both biologists and computer scientists over the last few years. Methods used in reconstruction include distance-based methods (such as neighbor-joining), parsimony methods using sequence-based encodings, Bayesian approaches, and direct optimization. The latter, pioneered by Sankoff and extended by us with the software suite GRAPPA, is the most accurate approach, but cannot handle more than about 15 genomes of limited size (e.g. organelles).

RESULTS

We report here on our successful efforts to scale up direct optimization through a two-step approach: the first step decomposes the dataset into smaller pieces and runs the direct optimization (GRAPPA) on the smaller pieces, while the second step builds a tree from the results obtained on the smaller pieces. We used the sophisticated disk-covering method (DCM) pioneered by Warnow and her group, suitably modified to take into account the computational limitations of GRAPPA. We find that DCM-GRAPPA scales gracefully to at least 1000 genomes of a few hundred genes each and retains surprisingly high accuracy throughout the range: in our experiments, the topological error rate rarely exceeded a few percent. Thus, reconstruction based on gene-order data can now be accomplished with high accuracy on datasets of significant size.

摘要

动机

在过去几年中，基于基因顺序数据的系统发育重建吸引了生物学家和计算机科学家越来越多的关注。重建中使用的方法包括基于距离的方法（如邻接法）、使用基于序列编码的简约法、贝叶斯方法以及直接优化。后者由桑科夫开创，并由我们通过软件套件GRAPPA进行扩展，是最准确的方法，但无法处理超过约15个大小有限的基因组（如细胞器基因组）。

结果

我们在此报告通过两步法成功扩大直接优化规模的工作：第一步将数据集分解为较小的片段，并在这些较小的片段上运行直接优化（GRAPPA），而第二步则根据在较小片段上获得的结果构建一棵树。我们使用了由瓦尔诺及其团队开创的复杂的磁盘覆盖方法（DCM），并对其进行了适当修改以考虑GRAPPA的计算限制。我们发现DCM-GRAPPA能够很好地扩展到至少1000个每个包含几百个基因的基因组，并且在整个范围内都保持了惊人的高精度：在我们的实验中，拓扑错误率很少超过百分之几。因此，现在可以在具有相当规模的数据集上高精度地完成基于基因顺序数据的重建。