Institute for Frontier Life and Medical Sciences, Kyoto University, Kyoto, Japan.
Institute for Liberal Arts and Sciences, Kyoto University, Kyoto, Japan.
PLoS One. 2022 Jan 28;17(1):e0263344. doi: 10.1371/journal.pone.0263344. eCollection 2022.
Gene co-expression analysis is an attractive tool for leveraging enormous amounts of public RNA-seq datasets for the prediction of gene functions and regulatory mechanisms. However, the optimal data processing steps for the accurate prediction of gene co-expression from such large datasets remain unclear. Especially the importance of batch effect correction is understudied.
We processed RNA-seq data of 68 human and 76 mouse cell types and tissues using 50 different workflows into 7,200 genome-wide gene co-expression networks. We then conducted a systematic analysis of the factors that result in high-quality co-expression predictions, focusing on normalization, batch effect correction, and measure of correlation. We confirmed the key importance of high sample counts for high-quality predictions. However, choosing a suitable normalization approach and applying batch effect correction can further improve the quality of co-expression estimates, equivalent to a >80% and >40% increase in samples. In larger datasets, batch effect removal was equivalent to a more than doubling of the sample size. Finally, Pearson correlation appears more suitable than Spearman correlation, except for smaller datasets.
A key point for accurate prediction of gene co-expression is the collection of many samples. However, paying attention to data normalization, batch effects, and the measure of correlation can significantly improve the quality of co-expression estimates.
基因共表达分析是一种很有吸引力的工具,可利用大量公共 RNA-seq 数据集来预测基因功能和调控机制。然而,从如此庞大的数据集准确预测基因共表达的最佳数据处理步骤仍不清楚。特别是批次效应校正的重要性还没有得到充分研究。
我们使用 50 种不同的工作流程处理了 68 个人类和 76 种小鼠细胞类型和组织的 RNA-seq 数据,将其转化为 7200 个全基因组基因共表达网络。然后,我们对导致高质量共表达预测的因素进行了系统分析,重点是归一化、批次效应校正和相关度量。我们证实了高样本数量对于高质量预测的关键重要性。然而,选择合适的归一化方法并应用批次效应校正可以进一步提高共表达估计的质量,相当于样本数量增加 80%以上和 40%以上。在更大的数据集上,去除批次效应相当于将样本量增加一倍以上。最后,Pearson 相关比 Spearman 相关更适用,除非是较小的数据集。
准确预测基因共表达的一个关键点是收集大量样本。然而,注意数据归一化、批次效应和相关度量可以显著提高共表达估计的质量。