Suppr超能文献

用于改进基于似然性的系统发育估计的每个速率类别的独立碱基组成;5rf模型

An independent base composition of each rate class for improved likelihood-based phylogeny estimation; the 5rf model.

作者信息

Waddell Peter J, Bouckaert Remco

机构信息

School of Natural Sciences, Massey University, Palmerston North, New Zealand.

Centre for Computational Evolution, University of Auckland, Auckland, New Zealand.

出版信息

bioRxiv. 2024 Sep 7:2024.09.03.610719. doi: 10.1101/2024.09.03.610719.

Abstract

The combination of a ime eversible Markov process with a "hidden" mixture of amma distributed relative site rates plus nvariant sites have become the most favoured options for likelihood and other probabilistic models of nucleotide evolution (e.g., tr4gi which approximates a gamma with four rate classes). However, these models assume a homogeneous and stationary distribution of nucleotide (character or base) frequencies. Here, we explore the potential benefits and pitfalls of allowing each rate category (rate class) of a 4gi mixture model to have its own base frequencies. This is achieved by starting each of the five rate classes, at the tree's oot, with its own free choice of nucleotide requencies to create a 4gi5rf model or a 5rf model in shorthand. We assess the practical identifiability of this approach with a BEAST 2 implementation, aiming to determine if it can accurately estimate credibility intervals and expected values for a wide range of plausible parameter values. Practical identifiability, as distinguished from mathematical identifiability, gauges the model's ability to identify parameters in real-world scenarios, as opposed to theoretically with infinite data. One of the most common types of phylogenetic data is mitochondrial DNA (mtDNA) protein coding sequence. It is often assumed current models analyse robustly such data and that higher likelihood/posterior probability models do better. However, this abstract shows that vertebrate mtDNA remains a very difficult type of data to fully model, and that dramatically higher likelihoods do not mean a model is measurably more accurate with respect to recovering key parameters of biological interest (e.g., monophyletic groups, their support and their ages). The 4gi5rf model considerably improves marginal likelihoods and seems to reverse some apparent errors exacerbated by the 4gi model, while introducing others. Problems appear to be linked to non-stationary DNA repair processes that alter the mutation/substitution spectra across lineages and time. We also show such problems are not unique to mtDNA and are encountered in analysing nuclear sequences. Non-stationarity of DNA repair processes mutation/substitution spectra thus pose an active challenge to obtaining reliable inferences of relationships and divergence times near the root of placental mammals, for example. An open source implementation is available under the LGPL 3.0 license in the beastbooster package for BEAST 2, available from https://github.com/rbouckaert/beastbooster.

摘要

将一个不可逆马尔可夫过程与伽马分布的相对位点速率的“隐藏”混合以及不变位点相结合,已成为核苷酸进化的似然模型和其他概率模型(例如,近似具有四个速率类别的伽马分布的tr4gi)最青睐的选择。然而,这些模型假定核苷酸(特征或碱基)频率呈均匀且平稳的分布。在此,我们探讨了允许4gi混合模型的每个速率类别(速率类)具有其自身碱基频率的潜在益处和缺陷。这是通过在树的根部为五个速率类别中的每一个开始时,自由选择其自身的核苷酸频率来实现的,从而创建一个4gi5rf模型,简称为5rf模型。我们使用BEAST 2实现来评估这种方法的实际可识别性,旨在确定它是否能够针对广泛的合理参数值准确估计可信区间和期望值。与数学可识别性不同,实际可识别性衡量的是模型在现实场景中识别参数的能力,而非理论上具有无限数据时的情况。系统发育数据最常见的类型之一是线粒体DNA(mtDNA)蛋白质编码序列。人们通常认为当前模型能有力地分析此类数据,且更高似然性/后验概率的模型表现更好。然而,本摘要表明脊椎动物mtDNA仍然是一种极难完全建模的数据类型,而且显著更高的似然性并不意味着模型在恢复生物学感兴趣的关键参数(例如,单系类群、它们的支持度及其年龄)方面更准确可测。4gi5rf模型显著提高了边际似然性,似乎扭转了一些由4gi模型加剧的明显错误,同时也引入了其他错误。问题似乎与非平稳的DNA修复过程有关,这些过程会改变谱系和时间上的突变/替换谱。我们还表明,此类问题并非mtDNA所特有,在分析核序列时也会遇到。因此,DNA修复过程突变/替换谱的非平稳性对获得胎盘哺乳动物根部附近关系和分歧时间的可靠推断构成了现实挑战。可在BEAST 2 的beastbooster包中根据LGPL 3.0许可获得开源实现,可从https://github.com/rbouckaert/beastbooster获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/06aa/11398347/e27bf957fa8e/nihpp-2024.09.03.610719v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验