组学数据的机器学习预处理：儿童肥胖案例研究。

Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity.

机构信息

Department of Biochemistry and Molecular Biology II, University of Granada, 18071 Granada, Spain.

"José Mataix Verdú" Institute of Nutrition and Food Technology (INYTA), Center of Biomedical Research, University of Granada, 18100 Granada, Spain.

出版信息

Genes (Basel). 2023 Jan 18;14(2):248. doi: 10.3390/genes14020248.

DOI:10.3390/genes14020248

PMID:36833178

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9956296/

Abstract

The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate pre-processing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work.

摘要

在过去几年中，机器学习技术在构建疾病结果预测模型（基于组学和其他类型的分子数据）方面在生物医学领域得到了极大的关注。尽管如此，组学研究和机器学习工具的精湛技艺仍取决于算法的正确应用以及输入组学和分子数据的适当预处理和管理。目前，许多用于预测目的的基于组学数据使用机器学习的现有方法在以下几个关键步骤中都会犯错：实验设计、特征选择、数据预处理和算法选择。因此，我们提出当前的工作作为如何应对多组学人类数据固有的主要挑战的指导方针。为此，还为定义的每个步骤提出了一系列最佳实践和建议。特别是，描述了每个组学数据层的主要特点、每个源的最合适预处理方法，以及使用机器学习预测疾病发展的最佳实践和技巧的汇编。我们使用真实数据的示例来说明如何解决多组学研究中提到的关键问题（例如，生物学异质性、技术噪声、高维性、缺失值的存在和类不平衡）。最后，我们根据发现的结果定义了模型改进的建议，这些建议为未来的工作提供了基础。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bbb/9956296/8d915aeaff23/genes-14-00248-g001.jpg

相似文献

Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity.组学数据的机器学习预处理：儿童肥胖案例研究。

Genes (Basel). 2023 Jan 18;14(2):248. doi: 10.3390/genes14020248.

Using machine learning approaches for multi-omics data analysis: A review.使用机器学习方法进行多组学数据分析：综述

Biotechnol Adv. 2021 Jul-Aug;49:107739. doi: 10.1016/j.biotechadv.2021.107739. Epub 2021 Mar 29.

Machine Learning and Integrative Analysis of Biomedical Big Data.机器学习与生物医学大数据的综合分析。

Genes (Basel). 2019 Jan 28;10(2):87. doi: 10.3390/genes10020087.

Min-redundancy and max-relevance multi-view feature selection for predicting ovarian cancer survival using multi-omics data.基于多组学数据预测卵巢癌生存的最小冗余最大相关性多视图特征选择。

BMC Med Genomics. 2018 Sep 14;11(Suppl 3):71. doi: 10.1186/s12920-018-0388-0.

Benchmarking feature selection and feature extraction methods to improve the performances of machine-learning algorithms for patient classification using metabolomics biomedical data.对特征选择和特征提取方法进行基准测试，以提高使用代谢组学生物医学数据的机器学习算法在患者分类中的性能。

Comput Struct Biotechnol J. 2024 Mar 19;23:1274-1287. doi: 10.1016/j.csbj.2024.03.016. eCollection 2024 Dec.

Network-based integration of multi-omics data for clinical outcome prediction in neuroblastoma.基于网络的多组学数据整合用于神经母细胞瘤的临床预后预测。

Sci Rep. 2022 Sep 14;12(1):15425. doi: 10.1038/s41598-022-19019-5.

Multiomics and eXplainable artificial intelligence for decision support in insulin resistance early diagnosis: A pediatric population-based longitudinal study.多组学和可解释人工智能在胰岛素抵抗早期诊断中的决策支持：一项基于儿科人群的纵向研究。

Artif Intell Med. 2024 Oct;156:102962. doi: 10.1016/j.artmed.2024.102962. Epub 2024 Aug 20.

A Machine Learning-Based Approach Using Multi-omics Data to Predict Metabolic Pathways.基于多组学数据的机器学习方法预测代谢途径

Methods Mol Biol. 2023;2553:441-452. doi: 10.1007/978-1-0716-2617-7_19.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Computerized decision support and machine learning applications for the prevention and treatment of childhood obesity: A systematic review of the literature.用于预防和治疗儿童肥胖的计算机化决策支持与机器学习应用：文献系统综述

Artif Intell Med. 2020 Apr;104:101844. doi: 10.1016/j.artmed.2020.101844. Epub 2020 Mar 19.

引用本文的文献

How quantum computing can enhance biomarker discovery.量子计算如何助力生物标志物发现。

Patterns (N Y). 2025 Apr 29;6(6):101236. doi: 10.1016/j.patter.2025.101236. eCollection 2025 Jun 13.

Demystifying the black box: A survey on explainable artificial intelligence (XAI) in bioinformatics.揭开黑箱之谜：生物信息学中可解释人工智能（XAI）的调查。

Comput Struct Biotechnol J. 2025 Jan 10;27:346-359. doi: 10.1016/j.csbj.2024.12.027. eCollection 2025.

Multimodal machine learning for analysing multifactorial causes of disease-The case of childhood overweight and obesity in Mexico.用于分析多因素疾病成因的多模态机器学习——以墨西哥儿童超重和肥胖为例。

Front Public Health. 2025 Jan 7;12:1369041. doi: 10.3389/fpubh.2024.1369041. eCollection 2024.

Multi-omics architecture of childhood obesity and metabolic dysfunction uncovers biological pathways and prenatal determinants.儿童肥胖和代谢功能障碍的多组学结构揭示了生物学途径和产前决定因素。

Nat Commun. 2025 Jan 14;16(1):654. doi: 10.1038/s41467-025-56013-7.

Integrative multi-omics and systems bioinformatics in translational neuroscience: A data mining perspective.转化神经科学中的整合多组学与系统生物信息学：数据挖掘视角

J Pharm Anal. 2023 Aug;13(8):836-850. doi: 10.1016/j.jpha.2023.06.011. Epub 2023 Jun 30.

Special Issue: New Advances in Bioinformatics and Biomedical Engineering Using Machine Learning Techniques, IWBBIO-2022.特刊：机器学习技术在生物信息学和生物医学工程中的新进展，IWBBIO-2022。

Genes (Basel). 2023 Aug 1;14(8):1574. doi: 10.3390/genes14081574.

本文引用的文献

The EWAS Catalog: a database of epigenome-wide association studies.表观基因组关联研究数据库：EWAS Catalog

Wellcome Open Res. 2022 May 31;7:41. doi: 10.12688/wellcomeopenres.17598.2. eCollection 2022.

Genotyping, the Usefulness of Imputation to Increase SNP Density, and Imputation Methods and Tools.基因分型、增加单核苷酸多态性（SNP）密度的填充实用性以及填充方法和工具

Methods Mol Biol. 2022;2467:113-138. doi: 10.1007/978-1-0716-2205-6_4.

Multi-omic machine learning predictor of breast cancer therapy response.乳腺癌治疗反应的多组学机器学习预测器。

Nature. 2022 Jan;601(7894):623-629. doi: 10.1038/s41586-021-04278-5. Epub 2021 Dec 7.

Feature selection revisited in the single-cell era.单细胞时代的特征选择再探讨。

Genome Biol. 2021 Dec 1;22(1):321. doi: 10.1186/s13059-021-02544-3.

Navigating the pitfalls of applying machine learning in genomics.在基因组学中应用机器学习的陷阱。

Nat Rev Genet. 2022 Mar;23(3):169-181. doi: 10.1038/s41576-021-00434-9. Epub 2021 Nov 26.

Machine learning meets omics: applications and perspectives.机器学习与组学的融合：应用与展望。

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab460.

A guide to machine learning for biologists.生物学机器学习指南。

Nat Rev Mol Cell Biol. 2022 Jan;23(1):40-55. doi: 10.1038/s41580-021-00407-0. Epub 2021 Sep 13.

Unraveling the Role of Leptin in Liver Function and Its Relationship with Liver Diseases.解析瘦素在肝功能中的作用及其与肝脏疾病的关系。

Int J Mol Sci. 2020 Dec 9;21(24):9368. doi: 10.3390/ijms21249368.

Penalization and shrinkage methods produced unreliable clinical prediction models especially when sample size was small.惩罚和收缩方法会产生不可靠的临床预测模型，尤其是在样本量较小时。

J Clin Epidemiol. 2021 Apr;132:88-96. doi: 10.1016/j.jclinepi.2020.12.005. Epub 2020 Dec 8.

A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI.可解释人工智能（XAI）研究综述：迈向医学 XAI

IEEE Trans Neural Netw Learn Syst. 2021 Nov;32(11):4793-4813. doi: 10.1109/TNNLS.2020.3027314. Epub 2021 Oct 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

组学数据的机器学习预处理：儿童肥胖案例研究。

Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献