Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724.
Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724.
Proc Natl Acad Sci U S A. 2021 Oct 5;118(40). doi: 10.1073/pnas.2025782118.
Density estimation in sequence space is a fundamental problem in machine learning that is also of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy (i.e., calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates). Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data are plentiful while still maintaining a conservative maximum entropy character in regions of sequence space where data are sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyperparameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5' splice sites found in the human genome and to understand patterns of chromosomal abnormalities across human cancers.
序列空间中的密度估计是机器学习中的一个基本问题,在计算生物学中也非常重要。由于序列空间的离散性质和高维性,如何从观察到的序列样本中最好地估计这些概率分布仍然不清楚。解决这个问题的一种常见策略是使用最大熵来估计概率分布(即,根据观察到的序列计算某些相关系数的点估计,并预测尽可能均匀的概率分布,同时仍然匹配这些点估计)。基于贝叶斯场论密度估计的最新进展,我们提出了这种最大熵方法的推广,该方法在数据丰富的序列空间区域提供了更大的表达能力,同时在数据稀疏或不存在的序列空间区域仍然保持保守的最大熵特征。具体来说,我们为序列空间上的概率分布定义了一个具有单个超参数的先验分布族,该超参数控制高阶相关的预期幅度。然后,这个先验分布族会产生一个相应的一维最大后验估计族,它在最大熵估计和观察到的样本频率之间平滑插值。为了展示这种方法的强大功能,我们使用它来探索人类基因组中发现的 5' 剪接位点分布的高维几何形状,并了解人类癌症中染色体异常的模式。