Suppr
超能文献

困惑度：在缺乏真实对照的情况下评估转录本丰度估计

Perplexity: evaluating transcript abundance estimation in the absence of ground truth.

作者信息

Fan Jason, Chan Skylar, Patro Rob

机构信息

Center for Bioinformatics and Computational Biology, University of Maryland, College Park, USA.

出版信息

Algorithms Mol Biol. 2022 Mar 25;17(1):6. doi: 10.1186/s13015-022-00214-y.

DOI:10.1186/s13015-022-00214-y

PMID:35331283

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8951746/

Abstract

BACKGROUND

There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best.

RESULTS

We derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. Furthermore, we demonstrate theoretically and experimentally that perplexity can be computed for arbitrary transcript abundance estimation models.

CONCLUSIONS

Alongside the derivation and implementation of perplexity for transcript abundance estimation, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth.

摘要

背景

用于从RNA测序数据估计转录本丰度的概率模型和推理方法发展迅速。这些模型旨在准确估计转录本水平的丰度，考虑测量过程中的不同偏差，甚至评估所得估计值中的不确定性，这些不确定性可传播到后续分析中。此类方法推断出的估计值的假定准确性支撑着实验室中常规进行的基于基因表达的分析。尽管已知超参数选择会影响推断丰度的分布（例如产生平滑估计值与稀疏估计值），但在实验数据中进行模型选择的策略充其量只是非正式地得到解决。

结果

我们推导出用于直接评估片段集丰度估计值的困惑度。我们从用于评估语言和主题模型的类似度量中改编了困惑度，并扩展了该度量以仔细考虑RNA测序特有的极端情况。在实验数据中，困惑度最佳的估计值也与定量聚合酶链反应测量结果相关性最佳。在模拟数据中，困惑度表现良好，并且与针对真实情况的全基因组测量和差异表达分析一致。此外，我们在理论和实验上证明，可以为任意转录本丰度估计模型计算困惑度。