Altay Gökmen, Zapardiel-Gonzalo Jose, Peters Bjoern
La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA.
bioRxiv. 2023 Jan 3:2023.01.02.522518. doi: 10.1101/2023.01.02.522518.
Gene network inference (GNI) methods have the potential to reveal functional relationships between different genes and their products. Most GNI algorithms have been developed for microarray gene expression datasets and their application to RNA-seq data is relatively recent. As the characteristics of RNA-seq data are different from microarray data, it is an unanswered question what preprocessing methods for RNA-seq data should be applied prior to GNI to attain optimal performance, or what the required sample size for RNA-seq data is to obtain reliable GNI estimates.
We ran 9144 analysis of 7 different RNA-seq datasets to evaluate 300 different preprocessing combinations that include data transformations, normalizations and association estimators. We found that there was no single best performing preprocessing combination but that there were several good ones. The performance varied widely over various datasets, which emphasized the importance of choosing an appropriate preprocessing configuration before GNI. Two preprocessing combinations appeared promising in general: First, Log-2 TPM (transcript per million) with Variance-stabilizing transformation (VST) and Pearson Correlation Coefficient (PCC) association estimator. Second, raw RNA-seq count data with PCC. Along with these two, we also identified 18 other good preprocessing combinations. Any of these algorithms might perform best in different datasets. Therefore, the GNI performances of these approaches should be measured on any new dataset to select the best performing one for it. In terms of the required biological sample size of RNA-seq data, we found that between 30 to 85 samples were required to generate reliable GNI estimates.
This study provides practical recommendations on default choices for data preprocessing prior to GNI analysis of RNA-seq data to obtain optimal performance results.
基因网络推断(GNI)方法有潜力揭示不同基因及其产物之间的功能关系。大多数GNI算法是针对微阵列基因表达数据集开发的,其在RNA测序(RNA-seq)数据中的应用相对较新。由于RNA-seq数据的特征与微阵列数据不同,在进行GNI之前应采用何种RNA-seq数据预处理方法以实现最佳性能,或者获得可靠的GNI估计所需的RNA-seq数据样本量是多少,这仍是一个未解决的问题。
我们对7个不同的RNA-seq数据集进行了9144次分析,以评估300种不同的预处理组合,这些组合包括数据转换、标准化和关联估计器。我们发现没有单一的最佳预处理组合,但有几种表现良好的组合。不同数据集的性能差异很大,这凸显了在进行GNI之前选择合适预处理配置的重要性。总体而言,有两种预处理组合表现出前景:第一,采用方差稳定变换(VST)和Pearson相关系数(PCC)关联估计器的对数2每百万转录本(TPM)。第二,采用PCC的原始RNA-seq计数数据。除了这两种,我们还确定了其他18种良好的预处理组合。这些算法中的任何一种在不同数据集中可能表现最佳。因此,应在任何新数据集上衡量这些方法的GNI性能,以选择最适合该数据集的方法。关于RNA-seq数据所需的生物样本量,我们发现需要30至85个样本才能生成可靠的GNI估计。
本研究为在对RNA-seq数据进行GNI分析之前进行数据预处理的默认选择提供了实用建议,以获得最佳性能结果。