Slaff Barry, Radens Caleb M, Jewell Paul, Jha Anupama, Lahens Nicholas F, Grant Gregory R, Thomas-Tikhonenko Andrei, Lynch Kristen W, Barash Yoseph
Department of Computer and Information Sciences, School of Engineering, University of Pennsylvania, Philadelphia, PA, USA.
Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
Nat Commun. 2021 Jun 7;12(1):3353. doi: 10.1038/s41467-021-23608-9.
The effects of confounding factors on gene expression analysis have been extensively studied following the introduction of high-throughput microarrays and subsequently RNA sequencing. In contrast, there is a lack of equivalent analysis and tools for RNA splicing. Here we first assess the effect of confounders on both expression and splicing quantifications in two large public RNA-Seq datasets (TARGET, ENCODE). We show quantification of splicing variations are affected at least as much as those of gene expression, revealing unwanted sources of variations in both datasets. Next, we develop MOCCASIN, a method to correct the effect of both known and unknown confounders on RNA splicing quantification and demonstrate MOCCASIN's effectiveness on both synthetic and real data. Code, synthetic and corrected datasets are all made available as resources.
在高通量微阵列以及随后的RNA测序技术引入之后,混杂因素对基因表达分析的影响已得到广泛研究。相比之下,对于RNA剪接却缺乏同等的分析方法和工具。在此,我们首先在两个大型公共RNA测序数据集(TARGET、ENCODE)中评估混杂因素对表达和剪接定量的影响。我们发现,剪接变异的定量受影响程度至少与基因表达的定量相同,这揭示了两个数据集中存在的不必要变异来源。接下来,我们开发了MOCCASIN方法,用于校正已知和未知混杂因素对RNA剪接定量的影响,并在合成数据和真实数据上证明了MOCCASIN的有效性。代码、合成数据集和校正后的数据集均作为资源提供。