Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, USA.
School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
Sci Rep. 2020 Oct 21;10(1):17925. doi: 10.1038/s41598-020-74567-y.
To use next-generation sequencing technology such as RNA-seq for medical and health applications, choosing proper analysis methods for biomarker identification remains a critical challenge for most users. The US Food and Drug Administration (FDA) has led the Sequencing Quality Control (SEQC) project to conduct a comprehensive investigation of 278 representative RNA-seq data analysis pipelines consisting of 13 sequence mapping, three quantification, and seven normalization methods. In this article, we focused on the impact of the joint effects of RNA-seq pipelines on gene expression estimation as well as the downstream prediction of disease outcomes. First, we developed and applied three metrics (i.e., accuracy, precision, and reliability) to quantitatively evaluate each pipeline's performance on gene expression estimation. We then investigated the correlation between the proposed metrics and the downstream prediction performance using two real-world cancer datasets (i.e., SEQC neuroblastoma dataset and the NIH/NCI TCGA lung adenocarcinoma dataset). We found that RNA-seq pipeline components jointly and significantly impacted the accuracy of gene expression estimation, and its impact was extended to the downstream prediction of these cancer outcomes. Specifically, RNA-seq pipelines that produced more accurate, precise, and reliable gene expression estimation tended to perform better in the prediction of disease outcome. In the end, we provided scenarios as guidelines for users to use these three metrics to select sensible RNA-seq pipelines for the improved accuracy, precision, and reliability of gene expression estimation, which lead to the improved downstream gene expression-based prediction of disease outcome.
为了将下一代测序技术(如 RNA-seq)应用于医疗健康领域,选择适当的生物标志物识别分析方法仍然是大多数用户面临的关键挑战。美国食品和药物管理局(FDA)主导了测序质量控制(SEQC)项目,对由 13 种序列映射、3 种定量和 7 种标准化方法组成的 278 个代表性 RNA-seq 数据分析管道进行了全面调查。在本文中,我们重点研究了 RNA-seq 管道的联合效应对基因表达估计以及下游疾病结果预测的影响。首先,我们开发并应用了三个指标(即准确性、精度和可靠性)来定量评估每个管道在基因表达估计方面的性能。然后,我们使用两个真实的癌症数据集(即 SEQC 神经母细胞瘤数据集和 NIH/NCI TCGA 肺腺癌数据集)调查了所提出的指标与下游预测性能之间的相关性。我们发现,RNA-seq 管道组件共同且显著影响基因表达估计的准确性,并且这种影响延伸到这些癌症结果的下游预测。具体而言,产生更准确、更精确和更可靠基因表达估计的 RNA-seq 管道往往在疾病结果的预测中表现更好。最后,我们提供了一些场景作为指南,供用户使用这三个指标来选择合理的 RNA-seq 管道,以提高基因表达估计的准确性、精度和可靠性,从而改善基于基因表达的下游疾病结果预测。