Danielsson Frida, James Tojo, Gomez-Cabrero David, Huss Mikael
Brief Bioinform. 2015 Nov;16(6):941-9. doi: 10.1093/bib/bbv017. Epub 2015 Mar 30.
Sequencing-based gene expression methods like RNA-sequencing (RNA-seq) have become increasingly common, but it is often claimed that results obtained in different studies are not comparable owing to the influence of laboratory batch effects, differences in RNA extraction and sequencing library preparation methods and bioinformatics processing pipelines. It would be unfortunate if different experiments were in fact incomparable, as there is great promise in data fusion and meta-analysis applied to sequencing data sets. We therefore compared reported gene expression measurements for ostensibly similar samples (specifically, human brain, heart and kidney samples) in several different RNA-seq studies to assess their overall consistency and to examine the factors contributing most to systematic differences. The same comparisons were also performed after preprocessing all data in a consistent way, eliminating potential bias from bioinformatics pipelines. We conclude that published human tissue RNA-seq expression measurements appear relatively consistent in the sense that samples cluster by tissue rather than laboratory of origin given simple preprocessing transformations. The article is supplemented by a detailed walkthrough with embedded R code and figures.
基于测序的基因表达方法,如RNA测序(RNA-seq)已变得越来越普遍,但人们常称,由于实验室批次效应、RNA提取和测序文库制备方法以及生物信息学处理流程的影响,不同研究中获得的结果无法进行比较。如果不同实验实际上确实无法比较,那将是不幸的,因为将数据融合和荟萃分析应用于测序数据集有很大的前景。因此,我们比较了几个不同RNA-seq研究中表面上相似样本(具体为人脑、心脏和肾脏样本)的报告基因表达测量值,以评估它们的总体一致性,并检查对系统差异贡献最大的因素。在以一致的方式对所有数据进行预处理后,消除生物信息学流程中的潜在偏差后,我们也进行了同样的比较。我们得出结论,在进行简单的预处理转换后,已发表的人类组织RNA-seq表达测量值在样本按组织而非来源实验室聚类的意义上显得相对一致。本文辅以带有嵌入式R代码和图表的详细演练。