Deng Fei, Feng Catherine H, Gao Nan, Zhang Lanjing
Department of Chemical Biology, Ernest Mario School of Pharmacy, Rutgers University, Piscataway, NJ 08854, USA.
Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA.
Trans Artif Intell. 2025;1(1). doi: 10.53941/tai.2025.100005. Epub 2025 May 25.
Normalization is a critical step in quantitative analyses of biological processes. Recent works show that cross-platform integration and normalization enable machine learning (ML) training on RNA microarray and RNA-seq data, but no independent datasets were used in their studies. Therefore, it is unclear how to improve ML modelling performance on independent RNA array and RNA-seq based datasets. Inspired by the house-keeping genes that are commonly used in experimental biology, this study tests the hypothesis that non-differentially expressed genes (NDEG) may improve normalization of transcriptomic data and subsequently cross-platform modelling performance of ML models. Microarray and RNA-seq datasets of the TCGA breast cancer were used as independent training and test datasets, respectively, to classify the molecular subtypes of breast cancer. NDEG ( > 0.85) and differentially expressed genes (DEG) ( < 0.05) were selected based on the values of ANOVA analysis and used for subsequent data normalization and classification, respectively. Models trained based on data from one platform were used for testing on the other platform. Our data show that NDEG and DEG gene selection could effectively improve the model classification performance. Normalization methods based on parametric statistical analysis were inferior to those based on nonparametric statistics. In this study, the LOG_QN and LOG_QNZ normalization methods combined with the neural network classification model seem to achieve better performance. Therefore, NDEG-based normalization appears useful for cross-platform testing on completely independent datasets. However, more studies are required to examine whether NDEG-based normalization can improve ML classification performance in other datasets and other omic data types.
归一化是生物过程定量分析中的关键步骤。最近的研究表明,跨平台整合和归一化能够实现基于RNA微阵列和RNA测序数据的机器学习(ML)训练,但这些研究中未使用独立数据集。因此,尚不清楚如何提高基于独立RNA阵列和RNA测序数据集的ML建模性能。受实验生物学中常用的管家基因启发,本研究检验了以下假设:非差异表达基因(NDEG)可能会改善转录组数据的归一化,进而提高ML模型的跨平台建模性能。分别使用TCGA乳腺癌的微阵列和RNA测序数据集作为独立的训练和测试数据集,对乳腺癌的分子亚型进行分类。基于方差分析(ANOVA)值选择NDEG(>0.85)和差异表达基因(DEG)(<0.05),并分别用于后续的数据归一化和分类。基于一个平台数据训练的模型用于在另一个平台上进行测试。我们的数据表明,NDEG和DEG基因选择可以有效提高模型分类性能。基于参数统计分析的归一化方法不如基于非参数统计的方法。在本研究中,LOG_QN和LOG_QNZ归一化方法与神经网络分类模型相结合似乎能取得更好的性能。因此,基于NDEG的归一化对于在完全独立的数据集上进行跨平台测试似乎是有用的。然而,需要更多研究来检验基于NDEG的归一化是否能提高其他数据集和其他组学数据类型中的ML分类性能。