Thron Christopher, Jafari Farhad
Department of Science and Mathematics, Texas A &M University-Central Texas, Killeen, TX, 76549, USA.
Department of Radiology, University of Minnesota, Minneapolis, MN, 55455, USA.
BMC Bioinformatics. 2025 Jan 28;26(1):32. doi: 10.1186/s12859-025-06041-3.
RNA sequencing (RNA-seq) is the conventional genome-scale approach used to capture the expression levels of all detectable genes in a biological sample. This is now regularly used for population-based studies designed to identify genetic determinants of various diseases. Naturally, the accuracy of these tests should be verified and improved if possible. In this study, we aimed to detect and correct for expression level-dependent errors which are not corrected by conventional normalization techniques. We examined several RNA-seq datasets from the Cancer Genome Atlas (TCGA), Stand Up 2 Cancer (SU2C), and GTEx databases with various types of preprocessing. By applying local averaging, we found expression-level dependent biases that differ from sample to sample in all datasets studied. Using simulations, we show that these biases corrupt gene-gene correlation estimations and t tests between subpopulations. To mitigate these biases, we introduce two different nonlinear transforms based on statistical considerations that correct these observed biases. We demonstrate that these transforms effectively remove the observed per-sample biases, reduce sample-to-sample variance, and improve the characteristics of gene-gene correlation distributions. Using a novel simulation methodology that creates controlled differences between subpopulations, we show that these transforms reduce variability and increase sensitivity of two population tests. The improvements in sensitivity and specificity were of the order of 3-5% in most instances after the data was corrected for bias. Altogether, these results improve our capacity to understand gene-gene relationships, and may lead to novel ways to utilize the information derived from clinical tests.
RNA测序(RNA-seq)是一种传统的基因组规模方法,用于获取生物样本中所有可检测基因的表达水平。目前,它经常用于旨在识别各种疾病遗传决定因素的人群研究。自然而然地,如果可能的话,这些检测的准确性应该得到验证和提高。在本研究中,我们旨在检测并校正传统归一化技术无法校正的与表达水平相关的误差。我们检查了来自癌症基因组图谱(TCGA)、“站起来对抗癌症”(SU2C)和基因型-组织表达(GTEx)数据库的几个经过各种预处理的RNA-seq数据集。通过应用局部平均法,我们在所有研究的数据集中发现了样本间不同的与表达水平相关的偏差。通过模拟,我们表明这些偏差会破坏基因-基因相关性估计以及亚群之间的t检验。为了减轻这些偏差,我们基于统计考虑引入了两种不同的非线性变换,以校正这些观察到的偏差。我们证明这些变换有效地消除了观察到的每个样本的偏差,降低了样本间的方差,并改善了基因-基因相关性分布的特征。使用一种在亚群之间创建可控差异的新型模拟方法,我们表明这些变换降低了变异性并提高了两种群体检验的灵敏度。在对数据进行偏差校正后,大多数情况下灵敏度和特异性的提高幅度约为3%-5%。总之,这些结果提高了我们理解基因-基因关系的能力,并可能带来利用临床检测所得信息的新方法。