• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过拓扑压力进行编码序列密度估计。

Coding sequence density estimation via topological pressure.

作者信息

Koslicki David, Thompson Daniel J

机构信息

Department of Mathematics, Oregon State University, 368 Kidder Hall, Corvallis, OR , 97330, USA,

出版信息

J Math Biol. 2015 Jan;70(1-2):45-69. doi: 10.1007/s00285-014-0754-2. Epub 2014 Jan 22.

DOI:10.1007/s00285-014-0754-2
PMID:24448658
Abstract

We give a new approach to coding sequence (CDS) density estimation in genomic analysis based on the topological pressure, which we develop from a well known concept in ergodic theory. Topological pressure measures the 'weighted information content' of a finite word, and incorporates 64 parameters which can be interpreted as a choice of weight for each nucleotide triplet. We train the parameters so that the topological pressure fits the observed coding sequence density on the human genome, and use this to give ab initio predictions of CDS density over windows of size around 66,000 bp on the genomes of Mus Musculus, Rhesus Macaque and Drososphilia Melanogaster. While the differences between these genomes are too great to expect that training on the human genome could predict, for example, the exact locations of genes, we demonstrate that our method gives reasonable estimates for the 'coarse scale' problem of predicting CDS density. Inspired again by ergodic theory, the weightings of the nucleotide triplets obtained from our training procedure are used to define a probability distribution on finite sequences, which can be used to distinguish between intron and exon sequences from the human genome of lengths between 750 and 5,000 bp. At the end of the paper, we explain the theoretical underpinning for our approach, which is the theory of Thermodynamic Formalism from the dynamical systems literature. Mathematica and MATLAB implementations of our method are available at http://sourceforge.net/projects/topologicalpres/ .

摘要

我们基于拓扑压力提出了一种基因组分析中编码序列(CDS)密度估计的新方法,该方法是我们从遍历理论中的一个著名概念发展而来的。拓扑压力衡量有限字的“加权信息含量”,并包含64个参数,这些参数可解释为每个核苷酸三联体的权重选择。我们训练这些参数,使拓扑压力与人类基因组上观察到的编码序列密度相匹配,并以此对小家鼠、恒河猴和黑腹果蝇基因组上大小约为66,000 bp的窗口内的CDS密度进行从头预测。虽然这些基因组之间的差异太大,以至于期望在人类基因组上进行训练能够预测,例如,基因的确切位置是不现实的,但我们证明我们的方法对于预测CDS密度的“粗粒度”问题给出了合理的估计。再次受到遍历理论的启发,从我们的训练过程中获得的核苷酸三联体的权重用于定义有限序列上的概率分布,该分布可用于区分人类基因组中长度在750到5,000 bp之间的内含子和外显子序列。在本文结尾,我们解释了我们方法的理论基础,即来自动力系统文献的热力学形式理论。我们方法的Mathematica和MATLAB实现可在http://sourceforge.net/projects/topologicalpres/获取。

相似文献

1
Coding sequence density estimation via topological pressure.通过拓扑压力进行编码序列密度估计。
J Math Biol. 2015 Jan;70(1-2):45-69. doi: 10.1007/s00285-014-0754-2. Epub 2014 Jan 22.
2
Spontaneous symmetry breaking in genome evolution.基因组进化中的自发对称性破缺。
Nucleic Acids Res. 2008 May;36(8):2756-63. doi: 10.1093/nar/gkn086. Epub 2008 Mar 26.
3
Topological entropy of DNA sequences.DNA 序列的拓扑熵。
Bioinformatics. 2011 Apr 15;27(8):1061-7. doi: 10.1093/bioinformatics/btr077. Epub 2011 Feb 10.
4
Ab initio gene finding in Drosophila genomic DNA.在果蝇基因组DNA中进行从头基因预测。
Genome Res. 2000 Apr;10(4):516-22. doi: 10.1101/gr.10.4.516.
5
Identification of human gene functional regions based on oligonucleotide composition.基于寡核苷酸组成鉴定人类基因功能区域
Proc Int Conf Intell Syst Mol Biol. 1993;1:371-9.
6
Leveraging human genomic information to identify nonhuman primate sequences for expression array development.利用人类基因组信息鉴定用于表达阵列开发的非人灵长类序列。
BMC Genomics. 2005 Nov 15;6:160. doi: 10.1186/1471-2164-6-160.
7
Gene prediction with a hidden Markov model and a new intron submodel.基于隐马尔可夫模型和新型内含子子模型的基因预测
Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. doi: 10.1093/bioinformatics/btg1080.
8
Prediction of complete gene structures in human genomic DNA.人类基因组DNA中完整基因结构的预测。
J Mol Biol. 1997 Apr 25;268(1):78-94. doi: 10.1006/jmbi.1997.0951.
9
Predicting mutually exclusive spliced exons based on exon length, splice site and reading frame conservation, and exon sequence homology.基于外显子长度、剪接位点和阅读框保守性以及外显子序列同源性预测相互排斥的剪接外显子。
BMC Bioinformatics. 2011 Jun 30;12:270. doi: 10.1186/1471-2105-12-270.
10
AUGUSTUS: a web server for gene finding in eukaryotes.奥古斯塔斯:用于真核生物基因发现的网络服务器。
Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W309-12. doi: 10.1093/nar/gkh379.

引用本文的文献

1
Long non-coding RNA NR2F1-AS1 induces breast cancer lung metastatic dormancy by regulating NR2F1 and ΔNp63.长链非编码 RNA NR2F1-AS1 通过调控 NR2F1 和 ΔNp63 诱导乳腺癌肺转移休眠
Nat Commun. 2021 Sep 2;12(1):5232. doi: 10.1038/s41467-021-25552-0.
2
Thermodynamic Formalism in Neuronal Dynamics and Spike Train Statistics.神经元动力学与脉冲序列统计中的热力学形式论
Entropy (Basel). 2020 Nov 23;22(11):1330. doi: 10.3390/e22111330.
3
Symbolic extensions applied to multiscale structure of genomes.应用于基因组多尺度结构的符号扩展。

本文引用的文献

1
Comparative genomics of Lupinus angustifolius gene-rich regions: BAC library exploration, genetic mapping and cytogenetics.窄叶羽扇豆基因丰富区的比较基因组学:BAC 文库探索、遗传作图和细胞遗传学。
BMC Genomics. 2013 Feb 5;14:79. doi: 10.1186/1471-2164-14-79.
2
The footprint of metabolism in the organization of mammalian genomes.哺乳动物基因组组织中的代谢足迹。
BMC Genomics. 2012 May 8;13:174. doi: 10.1186/1471-2164-13-174.
3
A beginner's guide to eukaryotic genome annotation.真核生物基因组注释入门指南。
Acta Biotheor. 2014 Jun;62(2):145-69. doi: 10.1007/s10441-014-9215-y. Epub 2014 Apr 13.
Nat Rev Genet. 2012 Apr 18;13(5):329-42. doi: 10.1038/nrg3174.
4
Statistical mechanics for natural flocks of birds.鸟类自然群体的统计力学。
Proc Natl Acad Sci U S A. 2012 Mar 27;109(13):4786-91. doi: 10.1073/pnas.1118633109. Epub 2012 Mar 16.
5
PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions.PhyloCSF:一种用于区分蛋白质编码区和非编码区的比较基因组学方法。
Bioinformatics. 2011 Jul 1;27(13):i275-82. doi: 10.1093/bioinformatics/btr209.
6
RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data.RNAcode:在比较序列数据中稳健地区分编码和非编码区域。
RNA. 2011 Apr;17(4):578-94. doi: 10.1261/rna.2536111. Epub 2011 Feb 28.
7
Topological entropy of DNA sequences.DNA 序列的拓扑熵。
Bioinformatics. 2011 Apr 15;27(8):1061-7. doi: 10.1093/bioinformatics/btr077. Epub 2011 Feb 10.
8
Genome analyses and modelling the relationships between coding density, recombination rate and chromosome length.基因组分析和建模编码密度、重组率与染色体长度之间的关系。
J Theor Biol. 2010 Nov 21;267(2):186-92. doi: 10.1016/j.jtbi.2010.08.022. Epub 2010 Aug 20.
9
Maximum entropy models for antibody diversity.最大熵模型在抗体多样性中的应用。
Proc Natl Acad Sci U S A. 2010 Mar 23;107(12):5405-10. doi: 10.1073/pnas.1001705107. Epub 2010 Mar 8.
10
Genomic analyses of sex chromosome evolution.性染色体进化的基因组分析。
Annu Rev Genomics Hum Genet. 2009;10:333-54. doi: 10.1146/annurev-genom-082908-150105.