Li Sheng, Łabaj Paweł P, Zumbo Paul, Sykacek Peter, Shi Wei, Shi Leming, Phan John, Wu Po-Yen, Wang May, Wang Charles, Thierry-Mieg Danielle, Thierry-Mieg Jean, Kreil David P, Mason Christopher E
1] Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York, USA. [2] The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York, USA. [3].
1] Chair of Bioinformatics Research Group, Boku University Vienna, Vienna, Austria. [2].
Nat Biotechnol. 2014 Sep;32(9):888-95. doi: 10.1038/nbt.3000. Epub 2014 Aug 24.
High-throughput RNA sequencing (RNA-seq) enables comprehensive scans of entire transcriptomes, but best practices for analyzing RNA-seq data have not been fully defined, particularly for data collected with multiple sequencing platforms or at multiple sites. Here we used standardized RNA samples with built-in controls to examine sources of error in large-scale RNA-seq studies and their impact on the detection of differentially expressed genes (DEGs). Analysis of variations in guanine-cytosine content, gene coverage, sequencing error rate and insert size allowed identification of decreased reproducibility across sites. Moreover, commonly used methods for normalization (cqn, EDASeq, RUV2, sva, PEER) varied in their ability to remove these systematic biases, depending on sample complexity and initial data quality. Normalization methods that combine data from genes across sites are strongly recommended to identify and remove site-specific effects and can substantially improve RNA-seq studies.
高通量RNA测序(RNA-seq)能够对整个转录组进行全面扫描,但分析RNA-seq数据的最佳方法尚未完全确定,特别是对于使用多个测序平台或在多个位点收集的数据。在这里,我们使用了带有内置对照的标准化RNA样本,以检查大规模RNA-seq研究中的误差来源及其对差异表达基因(DEG)检测的影响。对鸟嘌呤-胞嘧啶含量、基因覆盖率、测序错误率和插入片段大小的变异分析,有助于识别不同位点间再现性的降低。此外,常用的标准化方法(cqn、EDASeq、RUV2、sva、PEER)在消除这些系统偏差的能力上存在差异,这取决于样本复杂性和初始数据质量。强烈建议采用结合不同位点基因数据的标准化方法,以识别和消除位点特异性效应,并可显著改善RNA-seq研究。