Suppr超能文献

上下文相关替换最常用模型的陷阱。

Pitfalls of the most commonly used models of context dependent substitution.

作者信息

Lindsay Helen, Yap Von Bing, Ying Hua, Huttley Gavin A

机构信息

Computational Genomics Laboratory, John Curtin School of Medical Research, The Australian National University, Canberra, Australia.

出版信息

Biol Direct. 2008 Dec 16;3:52. doi: 10.1186/1745-6150-3-52.

Abstract

BACKGROUND

Neighboring nucleotides exert a striking influence on mutation, with the hypermutability of CpG dinucleotides in many genomes being an exemplar. Among the approaches employed to measure the relative importance of sequence neighbors on molecular evolution have been continuous-time Markov process models for substitutions that treat sequences as a series of independent tuples. The most widely used examples are the codon substitution models. We evaluated the suitability of derivatives of the nucleotide frequency weighted (hereafter NF) and tuple frequency weighted (hereafter TF) models for measuring sequence context dependent substitution. Critical properties we address are their relationships to an independent nucleotide process and the robustness of parameter estimation to changes in sequence composition. We then consider the impact on inference concerning dinucleotide substitution processes from application of these two forms to intron sequence alignments from primates.

RESULTS

We prove that the NF form always nests the independent nucleotide process and that this is not true for the TF form. As a consequence, using TF to study context effects can be misleading, which is shown by both theoretical calculations and simulations. We describe a simple example where a context parameter estimated under TF is confounded with composition terms unless all sequence states are equi-frequent. We illustrate this for the dinucleotide case by simulation under a nucleotide model, showing that the TF form identifies a CpG effect when none exists. Our analysis of primate introns revealed that the effect of nucleotide neighbors is over-estimated under TF compared with NF. Parameter estimates for a number of contexts are also strikingly discordant between the two model forms.

CONCLUSION

Our results establish that the NF form should be used for analysis of independent-tuple context dependent processes. Although neighboring effects in general are still important, prominent influences such as the elevated CpG transversion rate previously identified using the TF form are an artifact. Our results further suggest as few as 5 parameters may account for approximately 85% of neighboring nucleotide influence.

摘要

背景

相邻核苷酸对突变有显著影响,许多基因组中 CpG 二核苷酸的高突变性就是一个例证。在用于衡量序列邻域对分子进化相对重要性的方法中,有将序列视为一系列独立元组的连续时间马尔可夫过程替换模型。最广泛使用的例子是密码子替换模型。我们评估了核苷酸频率加权(以下简称 NF)和元组频率加权(以下简称 TF)模型的导数用于测量序列上下文依赖替换的适用性。我们探讨的关键特性包括它们与独立核苷酸过程的关系以及参数估计对序列组成变化的稳健性。然后,我们考虑将这两种形式应用于灵长类动物内含子序列比对时,对二核苷酸替换过程推断的影响。

结果

我们证明 NF 形式总是嵌套独立核苷酸过程,而 TF 形式并非如此。因此,使用 TF 来研究上下文效应可能会产生误导,理论计算和模拟均表明了这一点。我们描述了一个简单的例子,其中在 TF 下估计的上下文参数会与组成项混淆,除非所有序列状态等频率出现。我们通过在核苷酸模型下的模拟说明了二核苷酸情况,表明 TF 形式在不存在 CpG 效应时却识别出了这种效应。我们对灵长类动物内含子的分析表明,与 NF 相比,TF 会高估核苷酸邻域的效应。两种模型形式在许多上下文的参数估计上也存在显著差异。

结论

我们的结果表明,NF 形式应被用于分析独立元组上下文依赖过程。虽然一般来说邻域效应仍然很重要,但使用 TF 形式先前确定的诸如升高的 CpG 颠换率等显著影响是一种假象。我们的结果进一步表明,少至 5 个参数可能解释大约 85%的相邻核苷酸影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bc94/2628887/6a6ce7c0dea8/1745-6150-3-52-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验