• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

归一化和选择非差异表达基因可改善跨平台转录组数据的机器学习建模

Normalization and Selecting Non-Differentially Expressed Genes Improve Machine Learning Modelling of Cross-Platform Transcriptomic Data.

作者信息

Deng Fei, Feng Catherine H, Gao Nan, Zhang Lanjing

机构信息

Department of Chemical Biology, Ernest Mario School of Pharmacy, Rutgers University, Piscataway, NJ 08854, USA.

Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA.

出版信息

Trans Artif Intell. 2025;1(1). doi: 10.53941/tai.2025.100005. Epub 2025 May 25.

DOI:10.53941/tai.2025.100005
PMID:40630982
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12235674/
Abstract

Normalization is a critical step in quantitative analyses of biological processes. Recent works show that cross-platform integration and normalization enable machine learning (ML) training on RNA microarray and RNA-seq data, but no independent datasets were used in their studies. Therefore, it is unclear how to improve ML modelling performance on independent RNA array and RNA-seq based datasets. Inspired by the house-keeping genes that are commonly used in experimental biology, this study tests the hypothesis that non-differentially expressed genes (NDEG) may improve normalization of transcriptomic data and subsequently cross-platform modelling performance of ML models. Microarray and RNA-seq datasets of the TCGA breast cancer were used as independent training and test datasets, respectively, to classify the molecular subtypes of breast cancer. NDEG ( > 0.85) and differentially expressed genes (DEG) ( < 0.05) were selected based on the values of ANOVA analysis and used for subsequent data normalization and classification, respectively. Models trained based on data from one platform were used for testing on the other platform. Our data show that NDEG and DEG gene selection could effectively improve the model classification performance. Normalization methods based on parametric statistical analysis were inferior to those based on nonparametric statistics. In this study, the LOG_QN and LOG_QNZ normalization methods combined with the neural network classification model seem to achieve better performance. Therefore, NDEG-based normalization appears useful for cross-platform testing on completely independent datasets. However, more studies are required to examine whether NDEG-based normalization can improve ML classification performance in other datasets and other omic data types.

摘要

归一化是生物过程定量分析中的关键步骤。最近的研究表明,跨平台整合和归一化能够实现基于RNA微阵列和RNA测序数据的机器学习(ML)训练,但这些研究中未使用独立数据集。因此,尚不清楚如何提高基于独立RNA阵列和RNA测序数据集的ML建模性能。受实验生物学中常用的管家基因启发,本研究检验了以下假设:非差异表达基因(NDEG)可能会改善转录组数据的归一化,进而提高ML模型的跨平台建模性能。分别使用TCGA乳腺癌的微阵列和RNA测序数据集作为独立的训练和测试数据集,对乳腺癌的分子亚型进行分类。基于方差分析(ANOVA)值选择NDEG(>0.85)和差异表达基因(DEG)(<0.05),并分别用于后续的数据归一化和分类。基于一个平台数据训练的模型用于在另一个平台上进行测试。我们的数据表明,NDEG和DEG基因选择可以有效提高模型分类性能。基于参数统计分析的归一化方法不如基于非参数统计的方法。在本研究中,LOG_QN和LOG_QNZ归一化方法与神经网络分类模型相结合似乎能取得更好的性能。因此,基于NDEG的归一化对于在完全独立的数据集上进行跨平台测试似乎是有用的。然而,需要更多研究来检验基于NDEG的归一化是否能提高其他数据集和其他组学数据类型中的ML分类性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad69/12235674/2aaed42d6b3d/nihms-2087281-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad69/12235674/49099d8becb4/nihms-2087281-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad69/12235674/8d6bd2ab461c/nihms-2087281-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad69/12235674/5cb532914c50/nihms-2087281-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad69/12235674/1a99f654ed3c/nihms-2087281-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad69/12235674/2aaed42d6b3d/nihms-2087281-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad69/12235674/49099d8becb4/nihms-2087281-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad69/12235674/8d6bd2ab461c/nihms-2087281-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad69/12235674/5cb532914c50/nihms-2087281-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad69/12235674/1a99f654ed3c/nihms-2087281-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad69/12235674/2aaed42d6b3d/nihms-2087281-f0005.jpg

相似文献

1
Normalization and Selecting Non-Differentially Expressed Genes Improve Machine Learning Modelling of Cross-Platform Transcriptomic Data.归一化和选择非差异表达基因可改善跨平台转录组数据的机器学习建模
Trans Artif Intell. 2025;1(1). doi: 10.53941/tai.2025.100005. Epub 2025 May 25.
2
Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data.标准化和选择非差异表达基因可改善跨平台转录组数据的机器学习建模。
ArXiv. 2025 Jan 24:arXiv:2501.14248v1.
3
Association of normalization, non-differentially expressed genes and data source with machine learning performance in intra-dataset or cross-dataset modelling of transcriptomic and clinical data.在转录组学和临床数据的数据集内或跨数据集建模中,标准化、非差异表达基因和数据源与机器学习性能的关联。
ArXiv. 2025 Feb 27:arXiv:2502.18888v2.
4
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.
5
Sexual Harassment and Prevention Training性骚扰与预防培训
6
Home treatment for mental health problems: a systematic review.心理健康问题的居家治疗:一项系统综述
Health Technol Assess. 2001;5(15):1-139. doi: 10.3310/hta5150.
7
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
8
Molecular feature-based classification of retroperitoneal liposarcoma: a prospective cohort study.基于分子特征的腹膜后脂肪肉瘤分类:一项前瞻性队列研究。
Elife. 2025 May 23;14:RP100887. doi: 10.7554/eLife.100887.
9
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
10
Clinical symptoms, signs and tests for identification of impending and current water-loss dehydration in older people.老年人即将发生和当前失水脱水的识别的临床症状、体征及检查
Cochrane Database Syst Rev. 2015 Apr 30;2015(4):CD009647. doi: 10.1002/14651858.CD009647.pub2.

引用本文的文献

1
A multivariate cell-based liquid biopsy for lung nodule risk stratification: Analytical validation and early clinical evaluation.一种用于肺结节风险分层的基于多变量细胞的液体活检:分析验证和早期临床评估。
J Liq Biopsy. 2025 Jul 26;9:100313. doi: 10.1016/j.jlb.2025.100313. eCollection 2025 Sep.
2
Association of normalization, non-differentially expressed genes and data source with machine learning performance in intra-dataset or cross-dataset modelling of transcriptomic and clinical data.在转录组学和临床数据的数据集内或跨数据集建模中,标准化、非差异表达基因和数据源与机器学习性能的关联。
ArXiv. 2025 Feb 27:arXiv:2502.18888v2.

本文引用的文献

1
Advances in the Clinical Application of High-throughput Proteomics.高通量蛋白质组学的临床应用进展
Explor Res Hypothesis Med. 2024 Jul-Sep;9(3):209-220. doi: 10.14218/erhm.2024.00006. Epub 2024 Jul 3.
2
Union With Recursive Feature Elimination: A Feature Selection Framework to Improve the Classification Performance of Multicategory Causes of Death in Colorectal Cancer.基于递归特征消除的特征选择框架,提高结直肠癌多死因分类性能
Lab Invest. 2024 Mar;104(3):100320. doi: 10.1016/j.labinv.2023.100320. Epub 2023 Dec 28.
3
Novel biomarker genes for the prediction of post-hepatectomy survival of patients with NAFLD-related hepatocellular carcinoma.
用于预测非酒精性脂肪性肝病相关肝细胞癌患者肝切除术后生存的新型生物标志物基因。
Cancer Cell Int. 2023 Nov 10;23(1):269. doi: 10.1186/s12935-023-03106-2.
4
Validation of reference genes for the normalization of the RT-qPCR in peripheral blood mononuclear cells of septic patients.用于脓毒症患者外周血单个核细胞中RT-qPCR标准化的内参基因验证
Heliyon. 2023 Apr 7;9(4):e15269. doi: 10.1016/j.heliyon.2023.e15269. eCollection 2023 Apr.
5
Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously.跨平台归一化可实现微阵列和 RNA-seq 数据上的机器学习模型训练。
Commun Biol. 2023 Feb 25;6(1):222. doi: 10.1038/s42003-023-04588-6.
6
A meta-analysis of RNA-Seq studies to identify novel genes that regulate aging.基于 RNA-Seq 分析的调控衰老的新型基因鉴定的荟萃分析。
Exp Gerontol. 2023 Mar;173:112107. doi: 10.1016/j.exger.2023.112107. Epub 2023 Feb 1.
7
A comprehensive survey on computational learning methods for analysis of gene expression data.关于用于基因表达数据分析的计算学习方法的全面综述。
Front Mol Biosci. 2022 Nov 7;9:907150. doi: 10.3389/fmolb.2022.907150. eCollection 2022.
8
Programmable eukaryotic protein synthesis with RNA sensors by harnessing ADAR.通过利用腺苷脱氨酶作用于RNA(ADAR),借助RNA传感器实现可编程的真核生物蛋白质合成。
Nat Biotechnol. 2023 May;41(5):698-707. doi: 10.1038/s41587-022-01534-5. Epub 2022 Oct 27.
9
Removing unwanted variation from large-scale RNA sequencing data with PRPS.使用 PRPS 去除大规模 RNA 测序数据中的非期望变异。
Nat Biotechnol. 2023 Jan;41(1):82-95. doi: 10.1038/s41587-022-01440-w. Epub 2022 Sep 15.
10
Cross-platform validation of a mouse blood gene signature for quantitative reconstruction of radiation dose.跨平台验证小鼠血液基因特征用于定量重建辐射剂量。
Sci Rep. 2022 Aug 19;12(1):14124. doi: 10.1038/s41598-022-18558-1.