Stadler Peter F, Geiß Manuela, Schaller David, López Sánchez Alitzel, González Laffitte Marcos, Valdivia Dulce I, Hellmuth Marc, Hernández Rosales Maribel
1Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany.
2Competence Center for Scalable Data Services and Solutions Dresden/Leipzig, Interdisciplinary Center for Bioinformatics, German Centre for Integrative Biodiversity Research (iDiv), and Leipzig Research Center for Civilization Diseases, Universität Leipzig, Augustusplatz 12, 04107 Leipzig, Germany.
Algorithms Mol Biol. 2020 Apr 9;15:5. doi: 10.1186/s13015-020-00165-2. eCollection 2020.
Many of the commonly used methods for orthology detection start from mutually most similar pairs of genes (reciprocal best hits) as an approximation for evolutionary most closely related pairs of genes (reciprocal best matches). This approximation of best matches by best hits becomes exact for ultrametric dissimilarities, i.e., under the Molecular Clock Hypothesis. It fails, however, whenever there are large lineage specific rate variations among paralogous genes. In practice, this introduces a high level of noise into the input data for best-hit-based orthology detection methods.
If additive distances between genes are known, then evolutionary most closely related pairs can be identified by considering certain quartets of genes provided that in each quartet the outgroup relative to the remaining three genes is known. knowledge of underlying species phylogeny greatly facilitates the identification of the required outgroup. Although the workflow remains a heuristic since the correct outgroup cannot be determined reliably in all cases, simulations with lineage specific biases and rate asymmetries show that nearly perfect results can be achieved. In a realistic setting, where distances data have to be estimated from sequence data and hence are noisy, it is still possible to obtain highly accurate sets of best matches.
Improvements of tree-free orthology assessment methods can be expected from a combination of the accurate inference of best matches reported here and recent mathematical advances in the understanding of (reciprocal) best match graphs and orthology relations.
Accompanying software is available at https://github.com/david-schaller/AsymmeTree.
许多常用的直系同源物检测方法从基因的相互最相似对(相互最佳匹配)开始,将其作为进化上最密切相关的基因对(相互最佳匹配)的近似值。对于超度量差异,即根据分子钟假说,这种最佳匹配的近似值会变得精确。然而,当旁系同源基因之间存在较大的谱系特异性速率变化时,这种方法就会失效。在实践中,这会给基于最佳匹配的直系同源物检测方法的输入数据引入高水平的噪声。
如果基因之间的加性距离已知,那么只要在每个四重奏中相对于其余三个基因的外类群已知,就可以通过考虑某些基因四重奏来识别进化上最密切相关的对。对基础物种系统发育的了解极大地促进了所需外类群的识别。尽管由于在所有情况下都无法可靠地确定正确的外类群,工作流程仍然是一种启发式方法,但具有谱系特异性偏差和速率不对称的模拟表明,几乎可以获得完美的结果。在实际情况下,距离数据必须从序列数据中估计,因此存在噪声,但仍然有可能获得高度准确的最佳匹配集。
结合本文报道的最佳匹配的准确推断以及最近在理解(相互)最佳匹配图和直系同源关系方面的数学进展,可以预期无树直系同源评估方法会得到改进。