Department of Computer Science, Stanford University, Stanford, California, USA.
PLoS One. 2013 Jul 18;8(7):e68141. doi: 10.1371/journal.pone.0068141. Print 2013.
Transcriptomic assays that measure expression levels are widely used to study the manifestation of environmental or genetic variations in cellular processes. RNA-sequencing in particular has the potential to considerably improve such understanding because of its capacity to assay the entire transcriptome, including novel transcriptional events. However, as with earlier expression assays, analysis of RNA-sequencing data requires carefully accounting for factors that may introduce systematic, confounding variability in the expression measurements, resulting in spurious correlations. Here, we consider the problem of modeling and removing the effects of known and hidden confounding factors from RNA-sequencing data. We describe a unified residual framework that encapsulates existing approaches, and using this framework, present a novel method, HCP (Hidden Covariates with Prior). HCP uses a more informed assumption about the confounding factors, and performs as well or better than existing approaches while having a much lower computational cost. Our experiments demonstrate that accounting for known and hidden factors with appropriate models improves the quality of RNA-sequencing data in two very different tasks: detecting genetic variations that are associated with nearby expression variations (cis-eQTLs), and constructing accurate co-expression networks.
转录组分析检测是广泛用于研究细胞过程中环境或遗传变异表现的方法。RNA 测序,特别是由于其能够检测整个转录组,包括新的转录事件,因此具有极大的改善这种理解的潜力。然而,与早期的表达分析检测一样,RNA 测序数据的分析需要仔细考虑可能会给表达测量带来系统的、混杂的可变性的因素,从而导致虚假相关性。在这里,我们考虑从 RNA 测序数据中建模和去除已知和隐藏的混杂因素影响的问题。我们描述了一个统一的残差框架,它包含了现有的方法,并且使用这个框架,提出了一种新的方法,HCP(带有先验的隐藏协变量)。HCP 使用了关于混杂因素的更明智的假设,并且在计算成本低得多的情况下,表现得与现有方法一样好或更好。我们的实验表明,使用适当的模型来考虑已知和隐藏因素可以提高 RNA 测序数据在两个非常不同的任务中的质量:检测与附近表达变化相关的遗传变异(cis-eQTLs),以及构建准确的共表达网络。