Center for Nonlinear Studies and Theoretical Biology and Biophysics Group, Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA.
Department of Molecular Biomedical Sciences, North Carolina State University, Raleigh, North Carolina 27607, USA.
J Chem Phys. 2019 Jul 14;151(2):024106. doi: 10.1063/1.5110503.
Single cells exhibit a significant amount of variability in transcript levels, which arises from slow, stochastic transitions between gene expression states. Elucidating the nature of these states and understanding how transition rates are affected by different regulatory mechanisms require state-of-the-art methods to infer underlying models of gene expression from single cell data. A Bayesian approach to statistical inference is the most suitable method for model selection and uncertainty quantification of kinetic parameters using small data sets. However, this approach is impractical because current algorithms are too slow to handle typical models of gene expression. To solve this problem, we first show that time-dependent mRNA distributions of discrete-state models of gene expression are dynamic Poisson mixtures, whose mixing kernels are characterized by a piecewise deterministic Markov process. We combined this analytical result with a kinetic Monte Carlo algorithm to create a hybrid numerical method that accelerates the calculation of time-dependent mRNA distributions by 1000-fold compared to current methods. We then integrated the hybrid algorithm into an existing Monte Carlo sampler to estimate the Bayesian posterior distribution of many different, competing models in a reasonable amount of time. We demonstrate that kinetic parameters can be reasonably constrained for modestly sampled data sets if the model is known a priori. If there are many competing models, Bayesian evidence can rigorously quantify the likelihood of a model relative to other models from the data. We demonstrate that Bayesian evidence selects the true model and outperforms approximate metrics typically used for model selection.
单细胞表现出显著的转录水平变异性,这种变异性源于基因表达状态之间缓慢、随机的转变。阐明这些状态的性质并理解不同调控机制如何影响转变速率,需要最先进的方法从单细胞数据推断潜在的基因表达模型。贝叶斯统计推断方法是使用小数据集进行模型选择和动力学参数不确定性量化的最合适方法。然而,这种方法在实践中是不可行的,因为当前的算法对于处理典型的基因表达模型来说太慢了。为了解决这个问题,我们首先表明,基因表达离散状态模型的时变 mRNA 分布是动态泊松混合,其混合核由分段确定性马尔可夫过程表征。我们将这个分析结果与动力学蒙特卡罗算法相结合,创建了一种混合数值方法,与当前方法相比,该方法将时变 mRNA 分布的计算速度提高了 1000 倍。然后,我们将混合算法集成到现有的蒙特卡罗采样器中,以便在合理的时间内估计许多不同的、竞争的模型的贝叶斯后验分布。我们证明,如果模型是先验已知的,那么对于适度采样的数据,动力学参数可以得到合理的约束。如果有许多竞争模型,贝叶斯证据可以从数据中严格量化模型相对于其他模型的可能性。我们证明,贝叶斯证据选择了真实模型,并优于通常用于模型选择的近似度量。