Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.
Sci Rep. 2020 Feb 17;10(1):2734. doi: 10.1038/s41598-020-59516-z.
RNA-sequencing data is widely used to identify disease biomarkers and therapeutic targets using numerical methods such as clustering, classification, regression, and differential expression analysis. Such approaches rely on the assumption that mRNA abundance estimates from RNA-seq are reliable estimates of true expression levels. Here, using data from five RNA-seq processing pipelines applied to 6,690 human tumor and normal tissues, we show that nearly 88% of protein-coding genes have similar gene expression profiles across all pipelines. However, for >12% of protein-coding genes, current best-in-class RNA-seq processing pipelines differ in their abundance estimates by more than four-fold when applied to exactly the same samples and the same set of RNA-seq reads. Expression fold changes are similarly affected. Many of the impacted genes are widely studied disease-associated genes. We show that impacted genes exhibit diverse patterns of discordance among pipelines, suggesting that many inter-pipeline differences contribute to overall uncertainty in mRNA abundance estimates. A concerted, community-wide effort will be needed to develop gold-standards for estimating the mRNA abundance of the discordant genes reported here. In the meantime, our list of discordantly evaluated genes provides an important resource for robust marker discovery and target selection.
RNA 测序数据被广泛用于通过聚类、分类、回归和差异表达分析等数值方法来识别疾病生物标志物和治疗靶点。这些方法依赖于这样一个假设,即 RNA-seq 中 mRNA 丰度的估计是真实表达水平的可靠估计。在这里,我们使用来自 5 个 RNA-seq 处理管道的数据,这些数据应用于 6690 个人类肿瘤和正常组织,结果表明,几乎 88%的蛋白质编码基因在所有管道中具有相似的基因表达谱。然而,对于 >12%的蛋白质编码基因,当应用于完全相同的样本和相同的 RNA-seq 读取集时,当前最佳的 RNA-seq 处理管道在其丰度估计上的差异超过四倍。表达倍数变化也受到类似的影响。许多受影响的基因是广泛研究的疾病相关基因。我们表明,受影响的基因在管道之间表现出不同的不一致模式,这表明许多管道之间的差异导致了 mRNA 丰度估计的整体不确定性。需要进行协同的、全行业的努力,为这里报告的不一致基因的 mRNA 丰度估计开发黄金标准。在此期间,我们列出的不一致评估基因提供了一个重要的资源,用于稳健的标记物发现和靶标选择。