基于非编码序列聚类后验分布的高效上下文相关模型构建。

Efficient context-dependent model building based on clustering posterior distributions for non-coding sequences.

作者信息

Baele Guy, Van de Peer Yves, Vansteelandt Stijn

机构信息

Department of Applied Mathematics and Computer Science, Ghent University, Ghent, Belgium.

出版信息

BMC Evol Biol. 2009 Apr 30;9:87. doi: 10.1186/1471-2148-9-87.

DOI:10.1186/1471-2148-9-87

PMID:19405957

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2695821/

Abstract

BACKGROUND

Many recent studies that relax the assumption of independent evolution of sites have done so at the expense of a drastic increase in the number of substitution parameters. While additional parameters cannot be avoided to model context-dependent evolution, a large increase in model dimensionality is only justified when accompanied with careful model-building strategies that guard against overfitting. An increased dimensionality leads to increases in numerical computations of the models, increased convergence times in Bayesian Markov chain Monte Carlo algorithms and even more tedious Bayes Factor calculations.

RESULTS

We have developed two model-search algorithms which reduce the number of Bayes Factor calculations by clustering posterior densities to decide on the equality of substitution behavior in different contexts. The selected model's fit is evaluated using a Bayes Factor, which we calculate via model-switch thermodynamic integration. To reduce computation time and to increase the precision of this integration, we propose to split the calculations over different computers and to appropriately calibrate the individual runs. Using the proposed strategies, we find, in a dataset of primate Ancestral Repeats, that careful modeling of context-dependent evolution may increase model fit considerably and that the combination of a context-dependent model with the assumption of varying rates across sites offers even larger improvements in terms of model fit. Using a smaller nuclear SSU rRNA dataset, we show that context-dependence may only become detectable upon applying model-building strategies.

CONCLUSION

While context-dependent evolutionary models can increase the model fit over traditional independent evolutionary models, such complex models will often contain too many parameters. Justification for the added parameters is thus required so that only those parameters that model evolutionary processes previously unaccounted for are added to the evolutionary model. To obtain an optimal balance between the number of parameters in a context-dependent model and the performance in terms of model fit, we have designed two parameter-reduction strategies and we have shown that model fit can be greatly improved by reducing the number of parameters in a context-dependent evolutionary model.

摘要

背景

许多最近放宽位点独立进化假设的研究是以大幅增加替换参数数量为代价的。虽然为了对上下文依赖的进化进行建模，额外的参数是无法避免的，但只有在伴随着防止过度拟合的谨慎模型构建策略时，模型维度的大幅增加才是合理的。维度的增加会导致模型数值计算的增加、贝叶斯马尔可夫链蒙特卡罗算法收敛时间的增加以及更加繁琐的贝叶斯因子计算。

结果

我们开发了两种模型搜索算法，通过对后验密度进行聚类来减少贝叶斯因子计算的数量，以确定不同上下文中替换行为的相等性。使用贝叶斯因子评估所选模型的拟合度，我们通过模型切换热力学积分来计算该因子。为了减少计算时间并提高这种积分的精度，我们建议将计算分散到不同的计算机上，并对各个运行进行适当校准。使用所提出的策略，我们在灵长类祖先重复序列的数据集中发现，对上下文依赖进化进行仔细建模可能会显著提高模型拟合度，并且上下文依赖模型与位点间变化速率假设的结合在模型拟合方面提供了更大的改进。使用较小的核小亚基核糖体RNA数据集，我们表明只有在应用模型构建策略时，上下文依赖性才可能被检测到。

结论

虽然上下文依赖的进化模型比传统的独立进化模型能提高模型拟合度，但这种复杂模型通常会包含过多参数。因此需要为添加的参数提供合理依据，以便仅将那些对先前未考虑的进化过程进行建模的参数添加到进化模型中。为了在上下文依赖模型的参数数量与模型拟合性能之间获得最佳平衡，我们设计了两种参数减少策略，并且我们已经表明，通过减少上下文依赖进化模型中的参数数量，可以大大提高模型拟合度。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcf0/2695821/1843069876db/1471-2148-9-87-1.jpg

相似文献

Efficient context-dependent model building based on clustering posterior distributions for non-coding sequences.

BMC Evol Biol. 2009 Apr 30;9:87. doi: 10.1186/1471-2148-9-87.

Modelling the ancestral sequence distribution and model frequencies in context-dependent models for primate non-coding sequences.

BMC Evol Biol. 2010 Aug 10;10:244. doi: 10.1186/1471-2148-10-244.

Make the most of your samples: Bayes factor estimators for high-dimensional models of sequence evolution.

BMC Bioinformatics. 2013 Mar 6;14:85. doi: 10.1186/1471-2105-14-85.

Using non-reversible context-dependent evolutionary models to study substitution patterns in primate non-coding sequences.

J Mol Evol. 2010 Jul;71(1):34-50. doi: 10.1007/s00239-010-9362-y. Epub 2010 Jul 11.

Context-dependent codon partition models provide significant increases in model fit in atpB and rbcL protein-coding genes.

BMC Evol Biol. 2011 May 27;11:145. doi: 10.1186/1471-2148-11-145.

Identifiability of parameters in MCMC Bayesian inference of phylogeny.

Syst Biol. 2002 Oct;51(5):754-60. doi: 10.1080/10635150290102429.

A model-based approach to study nearest-neighbor influences reveals complex substitution patterns in non-coding sequences.

Syst Biol. 2008 Oct;57(5):675-92. doi: 10.1080/10635150802422324.

Bayesian coestimation of phylogeny and sequence alignment.

BMC Bioinformatics. 2005 Apr 1;6:83. doi: 10.1186/1471-2105-6-83.

Bayesian phylogenetic analysis of combined data.

Syst Biol. 2004 Feb;53(1):47-67. doi: 10.1080/10635150490264699.

Computational methods for evaluating phylogenetic models of coding sequence evolution with dependence between codons.

Mol Biol Evol. 2009 Jul;26(7):1663-76. doi: 10.1093/molbev/msp078. Epub 2009 Apr 21.

引用本文的文献

Parallel power posterior analyses for fast computation of marginal likelihoods in phylogenetics.

PeerJ. 2021 Nov 2;9:e12438. doi: 10.7717/peerj.12438. eCollection 2021.

Guanine holes are prominent targets for mutation in cancer and inherited disease.

PLoS Genet. 2013;9(9):e1003816. doi: 10.1371/journal.pgen.1003816. Epub 2013 Sep 26.

Make the most of your samples: Bayes factor estimators for high-dimensional models of sequence evolution.

BMC Bioinformatics. 2013 Mar 6;14:85. doi: 10.1186/1471-2105-14-85.

Context-dependent codon partition models provide significant increases in model fit in atpB and rbcL protein-coding genes.

BMC Evol Biol. 2011 May 27;11:145. doi: 10.1186/1471-2148-11-145.

Evolutionary divergence and limits of conserved non-coding sequence detection in plant genomes.

Nucleic Acids Res. 2011 Aug;39(14):6029-43. doi: 10.1093/nar/gkr179. Epub 2011 Apr 5.

Modelling the ancestral sequence distribution and model frequencies in context-dependent models for primate non-coding sequences.

BMC Evol Biol. 2010 Aug 10;10:244. doi: 10.1186/1471-2148-10-244.

Using non-reversible context-dependent evolutionary models to study substitution patterns in primate non-coding sequences.

J Mol Evol. 2010 Jul;71(1):34-50. doi: 10.1007/s00239-010-9362-y. Epub 2010 Jul 11.

本文引用的文献

BEST-FIT MAXIMUM-LIKELIHOOD MODELS FOR PHYLOGENETIC INFERENCE: EMPIRICAL TESTS WITH KNOWN PHYLOGENIES.

Evolution. 1998 Aug;52(4):978-987. doi: 10.1111/j.1558-5646.1998.tb01827.x.

Among-site rate variation and its impact on phylogenetic analyses.

Trends Ecol Evol. 1996 Sep;11(9):367-72. doi: 10.1016/0169-5347(96)10041-0.

A model-based approach to study nearest-neighbor influences reveals complex substitution patterns in non-coding sequences.

Syst Biol. 2008 Oct;57(5):675-92. doi: 10.1080/10635150802422324.

A general comparison of relaxed molecular clock models.

Mol Biol Evol. 2007 Dec;24(12):2669-80. doi: 10.1093/molbev/msm193. Epub 2007 Sep 21.

Assessing site-interdependent phylogenetic models of sequence evolution.

Mol Biol Evol. 2006 Sep;23(9):1762-75. doi: 10.1093/molbev/msl041. Epub 2006 Jun 20.

Computing Bayes factors using thermodynamic integration.

Syst Biol. 2006 Apr;55(2):195-207. doi: 10.1080/10635150500433722.

Should phylogenetic models be trying to "fit an elephant"?

Trends Genet. 2005 Jun;21(6):307-9. doi: 10.1016/j.tig.2005.04.001.

Identification and measurement of neighbor-dependent nucleotide substitution processes.

Bioinformatics. 2005 May 15;21(10):2322-8. doi: 10.1093/bioinformatics/bti376. Epub 2005 Mar 15.

Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution.

Proc Natl Acad Sci U S A. 2004 Sep 28;101(39):13994-4001. doi: 10.1073/pnas.0404142101. Epub 2004 Aug 3.

Phylogenetic estimation of context-dependent substitution rates by maximum likelihood.

Mol Biol Evol. 2004 Mar;21(3):468-88. doi: 10.1093/molbev/msh039. Epub 2003 Dec 5.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于非编码序列聚类后验分布的高效上下文相关模型构建。

Efficient context-dependent model building based on clustering posterior distributions for non-coding sequences.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献