Fan Jason, Chan Skylar, Patro Rob
Center for Bioinformatics and Computational Biology, University of Maryland, College Park, USA.
Algorithms Mol Biol. 2022 Mar 25;17(1):6. doi: 10.1186/s13015-022-00214-y.
There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best.
We derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. Furthermore, we demonstrate theoretically and experimentally that perplexity can be computed for arbitrary transcript abundance estimation models.
Alongside the derivation and implementation of perplexity for transcript abundance estimation, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth.
用于从RNA测序数据估计转录本丰度的概率模型和推理方法发展迅速。这些模型旨在准确估计转录本水平的丰度,考虑测量过程中的不同偏差,甚至评估所得估计值中的不确定性,这些不确定性可传播到后续分析中。此类方法推断出的估计值的假定准确性支撑着实验室中常规进行的基于基因表达的分析。尽管已知超参数选择会影响推断丰度的分布(例如产生平滑估计值与稀疏估计值),但在实验数据中进行模型选择的策略充其量只是非正式地得到解决。
我们推导出用于直接评估片段集丰度估计值的困惑度。我们从用于评估语言和主题模型的类似度量中改编了困惑度,并扩展了该度量以仔细考虑RNA测序特有的极端情况。在实验数据中,困惑度最佳的估计值也与定量聚合酶链反应测量结果相关性最佳。在模拟数据中,困惑度表现良好,并且与针对真实情况的全基因组测量和差异表达分析一致。此外,我们在理论和实验上证明,可以为任意转录本丰度估计模型计算困惑度。
除了推导和实现用于转录本丰度估计的困惑度之外,我们的研究首次使得在没有真实情况的情况下对实验数据进行转录本丰度估计的模型选择成为可能。