通过共同进化重建祖先基因组序列：形式定义、计算问题及生物学实例

Reconstructing ancestral genomic sequences by co-evolution: formal definitions, computational issues, and biological examples.

作者信息

Tuller Tamir, Birin Hadas, Kupiec Martin, Ruppin Eytan

机构信息

Faculty of Mathematics and Computer Science, Weizmann Institute of Science, Rehovot, Israel.

出版信息

J Comput Biol. 2010 Sep;17(9):1327-44. doi: 10.1089/cmb.2010.0112.

DOI:10.1089/cmb.2010.0112

PMID:20874411

Abstract

The inference of ancestral genomes is a fundamental problem in molecular evolution. Due to the statistical nature of this problem, the most likely or the most parsimonious ancestral genomes usually include considerable error rates. In general, these errors cannot be abolished by utilizing more exhaustive computational approaches, by using longer genomic sequences, or by analyzing more taxa. In recent studies, we showed that co-evolution is an important force that can be used for significantly improving the inference of ancestral genome content. In this work we formally define a computational problem for the inference of ancestral genome content by co-evolution. We show that this problem is NP-hard and hard to approximate and present both a Fixed Parameter Tractable (FPT) algorithm, and heuristic approximation algorithms for solving it. The running time of these algorithms on simulated inputs with hundreds of protein families and hundreds of co-evolutionary relations was fast (up to four minutes) and it achieved an approximation ratio of <1.3. We use our approach to study the ancestral genome content of the Fungi. To this end, we implement our approach on a dataset of 33, 931 protein families and 20, 317 co-evolutionary relations. Our algorithm added and removed hundreds of proteins from the ancestral genomes inferred by maximum likelihood (ML) or maximum parsimony (MP) while slightly affecting the likelihood/parsimony score of the results. A biological analysis revealed various pieces of evidence that support the biological plausibility of the new solutions. In addition, we showed that our approach reconstructs missing values at the leaves of the Fungi evolutionary tree better than ML or MP.

摘要

推断祖先基因组是分子进化中的一个基本问题。由于这个问题的统计性质，最可能或最简约的祖先基因组通常包含相当高的错误率。一般来说，利用更详尽的计算方法、使用更长的基因组序列或分析更多的分类群都无法消除这些错误。在最近的研究中，我们表明共同进化是一种重要力量，可用于显著改进祖先基因组内容的推断。在这项工作中，我们正式定义了一个通过共同进化推断祖先基因组内容的计算问题。我们表明这个问题是NP难的且难以近似，并提出了一种固定参数可处理（FPT）算法以及用于解决它的启发式近似算法。这些算法在具有数百个蛋白质家族和数百个共同进化关系的模拟输入上的运行时间很快（长达四分钟），并且实现了小于1.3的近似比率。我们使用我们的方法来研究真菌的祖先基因组内容。为此，我们在一个包含33931个蛋白质家族和20317个共同进化关系的数据集上实现了我们的方法。我们的算法在通过最大似然（ML）或最大简约（MP）推断的祖先基因组中添加和删除了数百种蛋白质，同时对结果的似然性/简约性得分影响较小。生物学分析揭示了各种证据，支持了新解决方案的生物学合理性。此外，我们表明我们的方法比ML或MP能更好地重建真菌进化树叶子处的缺失值。