• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于场论的生物序列空间密度估计及其在 5' 剪接位点多样性和癌症中非整倍体的应用。

Field-theoretic density estimation for biological sequence space with applications to 5' splice site diversity and aneuploidy in cancer.

机构信息

Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724.

Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724.

出版信息

Proc Natl Acad Sci U S A. 2021 Oct 5;118(40). doi: 10.1073/pnas.2025782118.

DOI:10.1073/pnas.2025782118
PMID:34599093
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8501885/
Abstract

Density estimation in sequence space is a fundamental problem in machine learning that is also of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy (i.e., calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates). Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data are plentiful while still maintaining a conservative maximum entropy character in regions of sequence space where data are sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyperparameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5' splice sites found in the human genome and to understand patterns of chromosomal abnormalities across human cancers.

摘要

序列空间中的密度估计是机器学习中的一个基本问题,在计算生物学中也非常重要。由于序列空间的离散性质和高维性,如何从观察到的序列样本中最好地估计这些概率分布仍然不清楚。解决这个问题的一种常见策略是使用最大熵来估计概率分布(即,根据观察到的序列计算某些相关系数的点估计,并预测尽可能均匀的概率分布,同时仍然匹配这些点估计)。基于贝叶斯场论密度估计的最新进展,我们提出了这种最大熵方法的推广,该方法在数据丰富的序列空间区域提供了更大的表达能力,同时在数据稀疏或不存在的序列空间区域仍然保持保守的最大熵特征。具体来说,我们为序列空间上的概率分布定义了一个具有单个超参数的先验分布族,该超参数控制高阶相关的预期幅度。然后,这个先验分布族会产生一个相应的一维最大后验估计族,它在最大熵估计和观察到的样本频率之间平滑插值。为了展示这种方法的强大功能,我们使用它来探索人类基因组中发现的 5' 剪接位点分布的高维几何形状,并了解人类癌症中染色体异常的模式。

相似文献

1
Field-theoretic density estimation for biological sequence space with applications to 5' splice site diversity and aneuploidy in cancer.基于场论的生物序列空间密度估计及其在 5' 剪接位点多样性和癌症中非整倍体的应用。
Proc Natl Acad Sci U S A. 2021 Oct 5;118(40). doi: 10.1073/pnas.2025782118.
2
Density estimation for ordinal biological sequences and its applications.有序生物序列的密度估计及其应用。
ArXiv. 2024 Apr 17:arXiv:2404.11228v1.
3
Unification of field theory and maximum entropy methods for learning probability densities.用于学习概率密度的场论与最大熵方法的统一
Phys Rev E Stat Nonlin Soft Matter Phys. 2015 Sep;92(3):032107. doi: 10.1103/PhysRevE.92.032107. Epub 2015 Sep 8.
4
Low-probability states, data statistics, and entropy estimation.低概率状态、数据统计与熵估计。
Phys Rev E. 2023 Jul;108(1-1):014101. doi: 10.1103/PhysRevE.108.014101.
5
Minimal entropy probability paths between genome families.基因组家族之间的最小熵概率路径。
J Math Biol. 2004 May;48(5):563-90. doi: 10.1007/s00285-003-0248-0. Epub 2003 Dec 2.
6
Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition.评估序列编码方案和机器学习方法在剪接位点识别中的性能。
Gene. 2019 Jul 15;705:113-126. doi: 10.1016/j.gene.2019.04.047. Epub 2019 Apr 19.
7
Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals.短序列基序的最大熵建模及其在RNA剪接信号中的应用
J Comput Biol. 2004;11(2-3):377-94. doi: 10.1089/1066527041410418.
8
Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).大分子拥挤现象:化学与物理邂逅生物学(瑞士阿斯科纳,2012年6月10日至14日)
Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.
9
Characterising RNA secondary structure space using information entropy.利用信息熵刻画 RNA 二级结构空间。
BMC Bioinformatics. 2013;14 Suppl 2(Suppl 2):S22. doi: 10.1186/1471-2105-14-S2-S22. Epub 2013 Jan 21.
10
Estimation of Probability Distribution and Its Application in Bayesian Classification and Maximum Likelihood Regression.概率分布的估计及其在贝叶斯分类和最大似然回归中的应用。
Interdiscip Sci. 2019 Sep;11(3):559-574. doi: 10.1007/s12539-019-00343-w. Epub 2019 Jul 17.

引用本文的文献

1
Learning sequence-function relationships with scalable, interpretable Gaussian processes.通过可扩展、可解释的高斯过程学习序列-函数关系。
bioRxiv. 2025 Aug 19:2025.08.15.670613. doi: 10.1101/2025.08.15.670613.
2
Inference and visualization of complex genotype-phenotype maps with .利用……对复杂基因型-表型图谱进行推断和可视化
bioRxiv. 2025 Mar 15:2025.03.09.642267. doi: 10.1101/2025.03.09.642267.
3
Scaling the fitness effects of mutations with respect to differentially adapted Arabidopsis thaliana accessions under natural conditions.衡量自然条件下不同适应性的拟南芥生态型中突变的适应性效应
Evolution. 2025 Jun 14;79(6):951-961. doi: 10.1093/evolut/qpaf029.
4
Density estimation for ordinal biological sequences and its applications.有序生物序列的密度估计及其应用。
Phys Rev E. 2024 Oct;110(4-1):044408. doi: 10.1103/PhysRevE.110.044408.
5
Impact of phylogeny on the inference of functional sectors from protein sequence data.系统发育对从蛋白质序列数据推断功能区的影响。
PLoS Comput Biol. 2024 Sep 23;20(9):e1012091. doi: 10.1371/journal.pcbi.1012091. eCollection 2024 Sep.
6
GENERALIST: A latent space based generative model for protein sequence families.通用:基于潜在空间的蛋白质序列家族生成模型。
PLoS Comput Biol. 2023 Nov 27;19(11):e1011655. doi: 10.1371/journal.pcbi.1011655. eCollection 2023 Nov.
7
Higher-order epistasis and phenotypic prediction.高阶上位性与表型预测。
Proc Natl Acad Sci U S A. 2022 Sep 27;119(39):e2204233119. doi: 10.1073/pnas.2204233119. Epub 2022 Sep 21.

本文引用的文献

1
Comprehensive database and evolutionary dynamics of U12-type introns.U12 型内含子的综合数据库和进化动态。
Nucleic Acids Res. 2020 Jul 27;48(13):7066-7078. doi: 10.1093/nar/gkaa464.
2
Minimum epistasis interpolation for sequence-function relationships.最小互作插值法用于序列-功能关系研究。
Nat Commun. 2020 Apr 14;11(1):1782. doi: 10.1038/s41467-020-15512-5.
3
Context is everything: aneuploidy in cancer.背景至关重要:癌症中的非整倍体。
Nat Rev Genet. 2020 Jan;21(1):44-62. doi: 10.1038/s41576-019-0171-x. Epub 2019 Sep 23.
4
Density Estimation on Small Data Sets.数据集较小情况下的密度估计。
Phys Rev Lett. 2018 Oct 19;121(16):160605. doi: 10.1103/PhysRevLett.121.160605.
5
Ranking noncanonical 5' splice site usage by genome-wide RNA-seq analysis and splicing reporter assays.通过全基因组 RNA-seq 分析和剪接报告基因实验对非规范 5' 剪接位点的使用进行排名。
Genome Res. 2018 Dec;28(12):1826-1840. doi: 10.1101/gr.235861.118. Epub 2018 Oct 24.
6
Deep generative models of genetic variation capture the effects of mutations.深度生成模型捕获遗传变异的突变效应。
Nat Methods. 2018 Oct;15(10):816-822. doi: 10.1038/s41592-018-0138-4. Epub 2018 Sep 24.
7
Genomic and Functional Approaches to Understanding Cancer Aneuploidy.基因组和功能方法研究癌症非整倍性。
Cancer Cell. 2018 Apr 9;33(4):676-689.e3. doi: 10.1016/j.ccell.2018.03.007. Epub 2018 Apr 2.
8
Determinants and clinical implications of chromosomal instability in cancer.癌症中染色体不稳定性的决定因素及其临床意义。
Nat Rev Clin Oncol. 2018 Mar;15(3):139-150. doi: 10.1038/nrclinonc.2017.198. Epub 2018 Jan 3.
9
Inverse statistical physics of protein sequences: a key issues review.蛋白质序列的反统计物理学:关键问题综述。
Rep Prog Phys. 2018 Mar;81(3):032601. doi: 10.1088/1361-6633/aa9965.
10
Probabilistic models for neural populations that naturally capture global coupling and criticality.自然捕捉全局耦合和临界性的神经群体概率模型。
PLoS Comput Biol. 2017 Sep 19;13(9):e1005763. doi: 10.1371/journal.pcbi.1005763. eCollection 2017 Sep.