Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Wynnewood, PA, USA.
Commun Biol. 2023 Feb 25;6(1):222. doi: 10.1038/s42003-023-04588-6.
Large compendia of gene expression data have proven valuable for the discovery of novel biological relationships. Historically, most available RNA assays were run on microarray, while RNA-seq is now the platform of choice for many new experiments. The data structure and distributions between the platforms differ, making it challenging to combine them directly. Here we perform supervised and unsupervised machine learning evaluations to assess which existing normalization methods are best suited for combining microarray and RNA-seq data. We find that quantile and Training Distribution Matching normalization allow for supervised and unsupervised model training on microarray and RNA-seq data simultaneously. Nonparanormal normalization and z-scores are also appropriate for some applications, including pathway analysis with Pathway-Level Information Extractor (PLIER). We demonstrate that it is possible to perform effective cross-platform normalization using existing methods to combine microarray and RNA-seq data for machine learning applications.
大量的基因表达数据已经被证明对于发现新的生物学关系非常有价值。从历史上看,大多数可用的 RNA 检测是在微阵列上进行的,而 RNA-seq 现在是许多新实验的首选平台。这两个平台的数据结构和分布不同,使得直接组合它们具有挑战性。在这里,我们进行了有监督和无监督的机器学习评估,以评估哪些现有的标准化方法最适合组合微阵列和 RNA-seq 数据。我们发现,分位数和训练分布匹配标准化允许在微阵列和 RNA-seq 数据上同时进行有监督和无监督的模型训练。非参数正态标准化和 z 分数在某些应用中也很合适,包括使用途径级信息提取器(Pathway-Level Information Extractor,PLIER)进行途径分析。我们证明,使用现有的方法进行有效的跨平台标准化是可能的,以便将微阵列和 RNA-seq 数据用于机器学习应用。