Suppr超能文献

基于场论的生物序列空间密度估计及其在 5' 剪接位点多样性和癌症中非整倍体的应用。

Field-theoretic density estimation for biological sequence space with applications to 5' splice site diversity and aneuploidy in cancer.

机构信息

Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724.

Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724.

出版信息

Proc Natl Acad Sci U S A. 2021 Oct 5;118(40). doi: 10.1073/pnas.2025782118.

Abstract

Density estimation in sequence space is a fundamental problem in machine learning that is also of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy (i.e., calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates). Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data are plentiful while still maintaining a conservative maximum entropy character in regions of sequence space where data are sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyperparameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5' splice sites found in the human genome and to understand patterns of chromosomal abnormalities across human cancers.

摘要

序列空间中的密度估计是机器学习中的一个基本问题,在计算生物学中也非常重要。由于序列空间的离散性质和高维性,如何从观察到的序列样本中最好地估计这些概率分布仍然不清楚。解决这个问题的一种常见策略是使用最大熵来估计概率分布(即,根据观察到的序列计算某些相关系数的点估计,并预测尽可能均匀的概率分布,同时仍然匹配这些点估计)。基于贝叶斯场论密度估计的最新进展,我们提出了这种最大熵方法的推广,该方法在数据丰富的序列空间区域提供了更大的表达能力,同时在数据稀疏或不存在的序列空间区域仍然保持保守的最大熵特征。具体来说,我们为序列空间上的概率分布定义了一个具有单个超参数的先验分布族,该超参数控制高阶相关的预期幅度。然后,这个先验分布族会产生一个相应的一维最大后验估计族,它在最大熵估计和观察到的样本频率之间平滑插值。为了展示这种方法的强大功能,我们使用它来探索人类基因组中发现的 5' 剪接位点分布的高维几何形状,并了解人类癌症中染色体异常的模式。

相似文献

3
Unification of field theory and maximum entropy methods for learning probability densities.用于学习概率密度的场论与最大熵方法的统一
Phys Rev E Stat Nonlin Soft Matter Phys. 2015 Sep;92(3):032107. doi: 10.1103/PhysRevE.92.032107. Epub 2015 Sep 8.
5
Minimal entropy probability paths between genome families.基因组家族之间的最小熵概率路径。
J Math Biol. 2004 May;48(5):563-90. doi: 10.1007/s00285-003-0248-0. Epub 2003 Dec 2.
9
Characterising RNA secondary structure space using information entropy.利用信息熵刻画 RNA 二级结构空间。
BMC Bioinformatics. 2013;14 Suppl 2(Suppl 2):S22. doi: 10.1186/1471-2105-14-S2-S22. Epub 2013 Jan 21.

引用本文的文献

5
Impact of phylogeny on the inference of functional sectors from protein sequence data.系统发育对从蛋白质序列数据推断功能区的影响。
PLoS Comput Biol. 2024 Sep 23;20(9):e1012091. doi: 10.1371/journal.pcbi.1012091. eCollection 2024 Sep.
6
GENERALIST: A latent space based generative model for protein sequence families.通用:基于潜在空间的蛋白质序列家族生成模型。
PLoS Comput Biol. 2023 Nov 27;19(11):e1011655. doi: 10.1371/journal.pcbi.1011655. eCollection 2023 Nov.
7
Higher-order epistasis and phenotypic prediction.高阶上位性与表型预测。
Proc Natl Acad Sci U S A. 2022 Sep 27;119(39):e2204233119. doi: 10.1073/pnas.2204233119. Epub 2022 Sep 21.

本文引用的文献

3
Context is everything: aneuploidy in cancer.背景至关重要:癌症中的非整倍体。
Nat Rev Genet. 2020 Jan;21(1):44-62. doi: 10.1038/s41576-019-0171-x. Epub 2019 Sep 23.
4
Density Estimation on Small Data Sets.数据集较小情况下的密度估计。
Phys Rev Lett. 2018 Oct 19;121(16):160605. doi: 10.1103/PhysRevLett.121.160605.
6
7
Genomic and Functional Approaches to Understanding Cancer Aneuploidy.基因组和功能方法研究癌症非整倍性。
Cancer Cell. 2018 Apr 9;33(4):676-689.e3. doi: 10.1016/j.ccell.2018.03.007. Epub 2018 Apr 2.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验