特征特异性分位数归一化可使用基因表达数据对分子亚型进行跨平台分类。

Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data.

机构信息

Department of Molecular and Systems Biology.

Department of Environmental Health Sciences, Arnold School of Public Health, University of South Carolina, Columbia, SC, 29208, USA.

出版信息

Bioinformatics. 2018 Jun 1;34(11):1868-1874. doi: 10.1093/bioinformatics/bty026.

DOI:10.1093/bioinformatics/bty026

PMID:29360996

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5972664/

Abstract

MOTIVATION

Molecular subtypes of cancers and autoimmune disease, defined by transcriptomic profiling, have provided insight into disease pathogenesis, molecular heterogeneity and therapeutic responses. However, technical biases inherent to different gene expression profiling platforms present a unique problem when analyzing data generated from different studies. Currently, there is a lack of effective methods designed to eliminate platform-based bias. We present a method to normalize and classify RNA-seq data using machine learning classifiers trained on DNA microarray data and molecular subtypes in two datasets: breast invasive carcinoma (BRCA) and colorectal cancer (CRC).

RESULTS

Multiple analyses show that feature specific quantile normalization (FSQN) successfully removes platform-based bias from RNA-seq data, regardless of feature scaling or machine learning algorithm. We achieve up to 98% accuracy for BRCA data and 97% accuracy for CRC data in assigning molecular subtypes to RNA-seq data normalized using FSQN and a support vector machine trained exclusively on DNA microarray data. We find that maximum accuracy was achieved when normalizing RNA-seq datasets that contain at least 25 samples. FSQN allows comparison of RNA-seq data to existing DNA microarray datasets. Using these techniques, we can successfully leverage information from existing gene expression data in new analyses despite different platforms used for gene expression profiling.

AVAILABILITY AND IMPLEMENTATION

FSQN has been submitted as an R package to CRAN. All code used for this study is available on Github (https://github.com/jenniferfranks/FSQN).

CONTACT

michael.l.whitfield@dartmouth.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

通过转录组谱分析定义的癌症和自身免疫性疾病的分子亚型，为疾病发病机制、分子异质性和治疗反应提供了深入了解。然而，不同基因表达谱分析平台固有的技术偏差在分析来自不同研究的数据时带来了独特的问题。目前，缺乏专门设计的有效方法来消除基于平台的偏差。我们提出了一种使用基于机器学习的分类器对 RNA-seq 数据进行归一化和分类的方法，该分类器是在两个数据集（乳腺癌浸润性癌（BRCA）和结直肠癌（CRC））的 DNA 微阵列数据和分子亚型上进行训练的。

结果

多项分析表明，特征特定分位数归一化（FSQN）可以成功地从 RNA-seq 数据中去除基于平台的偏差，而与特征缩放或机器学习算法无关。我们在将使用 FSQN 和专门在 DNA 微阵列数据上训练的支持向量机归一化的 RNA-seq 数据分配给分子亚型方面实现了高达 98%的 BRCA 数据准确性和 97%的 CRC 数据准确性。我们发现，当归一化包含至少 25 个样本的 RNA-seq 数据集时，可以实现最大准确性。FSQN 允许将 RNA-seq 数据与现有 DNA 微阵列数据集进行比较。使用这些技术，我们可以成功地利用新分析中现有基因表达数据的信息，尽管用于基因表达谱分析的平台不同。

可用性和实现

FSQN 已作为 R 包提交给 CRAN。本研究中使用的所有代码都可在 Github 上获得（https://github.com/jenniferfranks/FSQN）。

联系方式

michael.l.whitfield@dartmouth.edu。

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data.特征特异性分位数归一化可使用基因表达数据对分子亚型进行跨平台分类。

Bioinformatics. 2018 Jun 1;34(11):1868-1874. doi: 10.1093/bioinformatics/bty026.

Feature-specific quantile normalization and feature-specific mean-variance normalization deliver robust bi-directional classification and feature selection performance between microarray and RNAseq data.特征特异性分位数归一化和特征特异性均值方差归一化在微阵列和 RNAseq 数据之间提供了稳健的双向分类和特征选择性能。

BMC Bioinformatics. 2024 Mar 29;25(1):136. doi: 10.1186/s12859-024-05759-w.

Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling.整合RNA测序数据与异质性微阵列数据用于乳腺癌分析。

BMC Bioinformatics. 2017 Nov 21;18(1):506. doi: 10.1186/s12859-017-1925-0.

Parallel comparison of Illumina RNA-Seq and Affymetrix microarray platforms on transcriptomic profiles generated from 5-aza-deoxy-cytidine treated HT-29 colon cancer cells and simulated datasets.Illumina RNA-Seq 和 Affymetrix 微阵列平台在 5-aza-去氧胞苷处理的 HT-29 结肠癌细胞和模拟数据集产生的转录组图谱上的平行比较。

BMC Bioinformatics. 2013;14 Suppl 9(Suppl 9):S1. doi: 10.1186/1471-2105-14-S9-S1. Epub 2013 Jun 28.

Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine.使用松弛 Lasso 和广义多类支持向量机进行微阵列数据分析的特征选择和肿瘤分类。

J Theor Biol. 2019 Feb 21;463:77-91. doi: 10.1016/j.jtbi.2018.12.010. Epub 2018 Dec 8.

PLIDA: cross-platform gene expression normalization using perturbed topic models.PLIDA：基于受扰主题模型的跨平台基因表达归一化方法

Bioinformatics. 2014 Apr 1;30(7):956-61. doi: 10.1093/bioinformatics/btt574. Epub 2013 Oct 11.

Using microarray-based subtyping methods for breast cancer in the era of high-throughput RNA sequencing.在高通量 RNA 测序时代使用基于微阵列的乳腺癌亚型分类方法。

Mol Oncol. 2018 Dec;12(12):2136-2146. doi: 10.1002/1878-0261.12389. Epub 2018 Oct 29.

A probabilistic approach for automated discovery of perturbed genes using expression data from microarray or RNA-Seq.一种使用来自微阵列或RNA测序的表达数据自动发现受干扰基因的概率方法。

Comput Biol Med. 2015 Dec 1;67:29-40. doi: 10.1016/j.compbiomed.2015.07.029. Epub 2015 Aug 14.

Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously.跨平台归一化可实现微阵列和 RNA-seq 数据上的机器学习模型训练。

Commun Biol. 2023 Feb 25;6(1):222. doi: 10.1038/s42003-023-04588-6.

aRrayLasso: a network-based approach to microarray interconversion.阵列套索：一种基于网络的微阵列相互转换方法。

Bioinformatics. 2015 Dec 1;31(23):3859-61. doi: 10.1093/bioinformatics/btv469. Epub 2015 Aug 17.

引用本文的文献

Normalization and Selecting Non-Differentially Expressed Genes Improve Machine Learning Modelling of Cross-Platform Transcriptomic Data.归一化和选择非差异表达基因可改善跨平台转录组数据的机器学习建模

Trans Artif Intell. 2025;1(1). doi: 10.53941/tai.2025.100005. Epub 2025 May 25.

WMRCA + : a weighted majority rule-based clustering method for cancer subtype prediction using metabolic gene sets.WMRCA + ：一种基于加权多数规则的聚类方法，用于使用代谢基因集进行癌症亚型预测。

Hereditas. 2025 Jul 7;162(1):121. doi: 10.1186/s41065-025-00487-4.

Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data.标准化和选择非差异表达基因可改善跨平台转录组数据的机器学习建模。

ArXiv. 2025 Jan 24:arXiv:2501.14248v1.

Predicting the Progression from Asymptomatic to Symptomatic Multiple Myeloma and Stage Classification Using Gene Expression Data.利用基因表达数据预测无症状多发性骨髓瘤向有症状多发性骨髓瘤的进展及分期分类

Cancers (Basel). 2025 Jan 20;17(2):332. doi: 10.3390/cancers17020332.

Metabolic differentiation of brushtail possum populations resistant and susceptible to plant toxins revealed via differential gene expression.通过差异基因表达揭示对植物毒素具有抗性和易感性的帚尾袋貂种群的代谢分化

J Comp Physiol B. 2025 Feb;195(1):103-121. doi: 10.1007/s00360-024-01591-z. Epub 2024 Nov 4.

Evaluation of normalization methods for predicting quantitative phenotypes in metagenomic data analysis.宏基因组数据分析中预测定量表型的标准化方法评估

Front Genet. 2024 Jun 5;15:1369628. doi: 10.3389/fgene.2024.1369628. eCollection 2024.

Molecular Subtypes of High-Grade Serous Ovarian Cancer across Racial Groups and Gene Expression Platforms.高级别浆液性卵巢癌的分子亚型在不同种族群体和基因表达平台中的差异。

Cancer Epidemiol Biomarkers Prev. 2024 Aug 1;33(8):1114-1125. doi: 10.1158/1055-9965.EPI-24-0113.

A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.比较 RNA-Seq 数据预处理管道，以跨独立研究进行转录组预测。

BMC Bioinformatics. 2024 May 8;25(1):181. doi: 10.1186/s12859-024-05801-x.

BMC Bioinformatics. 2024 Mar 29;25(1):136. doi: 10.1186/s12859-024-05759-w.

Comparison of the effectiveness of different normalization methods for metagenomic cross-study phenotype prediction under heterogeneity.不同标准化方法在异质性下对宏基因组跨研究表型预测效果的比较。

Sci Rep. 2024 Mar 25;14(1):7024. doi: 10.1038/s41598-024-57670-2.

本文引用的文献

Cross-platform normalization of microarray and RNA-seq data for machine learning applications.用于机器学习应用的微阵列和RNA测序数据的跨平台归一化。

PeerJ. 2016 Jan 21;4:e1621. doi: 10.7717/peerj.1621. eCollection 2016.

The huge Package for High-dimensional Undirected Graph Estimation in R.R语言中用于高维无向图估计的庞大软件包。

J Mach Learn Res. 2012 Apr;13:1059-1062.

The consensus molecular subtypes of colorectal cancer.结直肠癌的共识分子亚型

Nat Med. 2015 Nov;21(11):1350-6. doi: 10.1038/nm.3967. Epub 2015 Oct 12.

Probe Region Expression Estimation for RNA-Seq Data for Improved Microarray Comparability.用于提高微阵列可比性的RNA测序数据的探针区域表达估计

PLoS One. 2015 May 12;10(5):e0126545. doi: 10.1371/journal.pone.0126545. eCollection 2015.

Molecular subtyping for clinically defined breast cancer subgroups.临床定义的乳腺癌亚组的分子分型

Breast Cancer Res. 2015 Feb 26;17(1):29. doi: 10.1186/s13058-015-0520-4.

Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells.RNA-Seq 和微阵列在激活 T 细胞转录组谱分析中的比较。

PLoS One. 2014 Jan 16;9(1):e78644. doi: 10.1371/journal.pone.0078644. eCollection 2014.

Large scale comparison of gene expression levels by microarrays and RNAseq using TCGA data.基于 TCGA 数据的基因表达水平的大规模比较：微阵列和 RNAseq 方法的比较。

PLoS One. 2013 Aug 20;8(8):e71462. doi: 10.1371/journal.pone.0071462. eCollection 2013.

A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis.一种使用引导主成分分析识别高通量基因组数据批次效应的新统计方法。

Bioinformatics. 2013 Nov 15;29(22):2877-83. doi: 10.1093/bioinformatics/btt480. Epub 2013 Aug 19.

Comprehensive molecular portraits of human breast tumours.人类乳腺肿瘤的全面分子特征图谱。

Nature. 2012 Oct 4;490(7418):61-70. doi: 10.1038/nature11412. Epub 2012 Sep 23.

Transcriptome classification reveals molecular subtypes in psoriasis.转录组分类揭示银屑病的分子亚型。

BMC Genomics. 2012 Sep 12;13:472. doi: 10.1186/1471-2164-13-472.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验