从机器学习范式的角度审视计算生物学中的组学数据整合。

Omics data integration in computational biology viewed through the prism of machine learning paradigms.

作者信息

Fouché Aziz, Zinovyev Andrei

机构信息

Institut Curie, PSL Research University, Paris, France.

Institut National de la Santé et de la Recherche Médicale, Paris, France.

出版信息

Front Bioinform. 2023 Aug 4;3:1191961. doi: 10.3389/fbinf.2023.1191961. eCollection 2023.

DOI:10.3389/fbinf.2023.1191961

PMID:37600970

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10436311/

Abstract

Important quantities of biological data can today be acquired to characterize cell types and states, from various sources and using a wide diversity of methods, providing scientists with more and more information to answer challenging biological questions. Unfortunately, working with this amount of data comes at the price of ever-increasing data complexity. This is caused by the multiplication of data types and batch effects, which hinders the joint usage of all available data within common analyses. Data integration describes a set of tasks geared towards embedding several datasets of different origins or modalities into a joint representation that can then be used to carry out downstream analyses. In the last decade, dozens of methods have been proposed to tackle the different facets of the data integration problem, relying on various paradigms. This review introduces the most common data types encountered in computational biology and provides systematic definitions of the data integration problems. We then present how machine learning innovations were leveraged to build effective data integration algorithms, that are widely used today by computational biologists. We discuss the current state of data integration and important pitfalls to consider when working with data integration tools. We eventually detail a set of challenges the field will have to overcome in the coming years.

摘要

如今，可以从各种来源并使用多种多样的方法获取大量重要的生物学数据，以表征细胞类型和状态，这为科学家提供了越来越多的信息来回答具有挑战性的生物学问题。不幸的是，处理如此大量的数据是以数据复杂性不断增加为代价的。这是由数据类型的增加和批次效应导致的，这阻碍了在常规分析中对所有可用数据的联合使用。数据整合描述了一组任务，旨在将几个不同来源或模态的数据集嵌入到一个联合表示中，然后可用于进行下游分析。在过去十年中，已经提出了几十种方法来解决数据整合问题的不同方面，这些方法依赖于各种范式。本综述介绍了计算生物学中遇到的最常见数据类型，并提供了数据整合问题的系统定义。然后，我们展示了如何利用机器学习创新来构建有效的数据整合算法，这些算法如今被计算生物学家广泛使用。我们讨论了数据整合的现状以及在使用数据整合工具时需要考虑的重要陷阱。我们最终详细阐述了该领域在未来几年必须克服的一系列挑战。

相似文献

Omics data integration in computational biology viewed through the prism of machine learning paradigms.从机器学习范式的角度审视计算生物学中的组学数据整合。

Front Bioinform. 2023 Aug 4;3:1191961. doi: 10.3389/fbinf.2023.1191961. eCollection 2023.

The future of Cochrane Neonatal.考克兰新生儿协作网的未来。

Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification头部损伤的转化代谢组学：基于体外核磁共振波谱的代谢物定量分析探索脑代谢功能障碍

Machine learning: its challenges and opportunities in plant system biology.机器学习：在植物系统生物学中的挑战与机遇。

Appl Microbiol Biotechnol. 2022 May;106(9-10):3507-3530. doi: 10.1007/s00253-022-11963-6. Epub 2022 May 16.

Unsupervised neural network for single cell Multi-omics INTegration (UMINT): an application to health and disease.用于单细胞多组学整合的无监督神经网络（UMINT）：在健康与疾病中的应用

Front Mol Biosci. 2023 May 24;10:1184748. doi: 10.3389/fmolb.2023.1184748. eCollection 2023.

A survey on single and multi omics data mining methods in cancer data classification.癌症数据分类中单/多组学数据挖掘方法的研究综述。

J Biomed Inform. 2020 Jul;107:103466. doi: 10.1016/j.jbi.2020.103466. Epub 2020 Jun 7.

Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources.代谢组学与多组学整合：计算方法与资源综述

Metabolites. 2020 May 15;10(5):202. doi: 10.3390/metabo10050202.

Evaluation and comparison of multi-omics data integration methods for cancer subtyping.癌症亚型的多组学数据整合方法的评估与比较。

PLoS Comput Biol. 2021 Aug 12;17(8):e1009224. doi: 10.1371/journal.pcbi.1009224. eCollection 2021 Aug.

Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).大分子拥挤现象：化学与物理邂逅生物学（瑞士阿斯科纳，2012年6月10日至14日）

Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.

Confero: an integrated contrast data and gene set platform for computational analysis and biological interpretation of omics data.Confero：一个集成的对比数据和基因集平台，用于计算分析和生物学解释组学数据。

BMC Genomics. 2013 Jul 29;14:514. doi: 10.1186/1471-2164-14-514.

引用本文的文献

Advancements in non-invasive biomarkers for detection and monitoring of breast cancer recurrence.用于检测和监测乳腺癌复发的非侵入性生物标志物的进展。

Sci Prog. 2025 Jul-Sep;108(3):368504251362350. doi: 10.1177/00368504251362350. Epub 2025 Aug 19.

A technical review of multi-omics data integration methods: from classical statistical to deep generative approaches.多组学数据整合方法的技术综述：从经典统计方法到深度生成方法

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf355.

A single-cell multimodal view on gene regulatory network inference from transcriptomics and chromatin accessibility data.单细胞多模态视角下从转录组学和染色质可及性数据推断基因调控网络。

Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae382.

scCross: a deep generative model for unifying single-cell multi-omics with seamless integration, cross-modal generation, and in silico exploration.scCross：一个深度生成模型，用于将单细胞多组学数据进行统一，实现无缝集成、跨模态生成和计算探索。

Genome Biol. 2024 Jul 29;25(1):198. doi: 10.1186/s13059-024-03338-z.

本文引用的文献

: a unifying computational framework for modular single-cell RNA-seq data integration.用于模块化单细胞RNA测序数据整合的统一计算框架。

NAR Genom Bioinform. 2023 Jul 12;5(3):lqad069. doi: 10.1093/nargab/lqad069. eCollection 2023 Sep.

MultiVI: deep generative model for the integration of multimodal data.MultiVI：用于多模态数据集成的深度生成模型。

Nat Methods. 2023 Aug;20(8):1222-1231. doi: 10.1038/s41592-023-01909-9. Epub 2023 Jun 29.

Stabilized mosaic single-cell data integration using unshared features.使用非共享特征稳定镶嵌单细胞数据集成。

Nat Biotechnol. 2024 Feb;42(2):284-292. doi: 10.1038/s41587-023-01766-z. Epub 2023 May 25.

Domain Adaptation Principal Component Analysis: Base Linear Method for Learning with Out-of-Distribution Data.域适应主成分分析：用于处理分布外数据学习的基础线性方法

Entropy (Basel). 2022 Dec 24;25(1):33. doi: 10.3390/e25010033.

Meta-Analysis of Human Cancer Single-Cell RNA-Seq Datasets Using the IMMUcan Database.基于 IMMUcan 数据库的人类癌症单细胞 RNA-Seq 数据集的荟萃分析

Cancer Res. 2023 Feb 3;83(3):363-373. doi: 10.1158/0008-5472.CAN-22-0074.

A unified computational framework for single-cell data integration with optimal transport.单细胞数据整合的最优传输统一计算框架。

Nat Commun. 2022 Dec 1;13(1):7419. doi: 10.1038/s41467-022-35094-8.

Multimodal Single-Cell Translation and Alignment with Semi-Supervised Learning.多模态单细胞翻译与半监督学习对齐。

J Comput Biol. 2022 Nov;29(11):1198-1212. doi: 10.1089/cmb.2022.0264. Epub 2022 Oct 14.

Alignment of single-cell trajectory trees with CAPITAL.单细胞轨迹树与 CAPITAL 的对齐。

Nat Commun. 2022 Oct 14;13(1):5972. doi: 10.1038/s41467-022-33681-3.

Polyphony: an Interactive Transfer Learning Framework for Single-Cell Data Analysis.多音性：单细胞数据分析的交互式迁移学习框架。

IEEE Trans Vis Comput Graph. 2023 Jan;29(1):591-601. doi: 10.1109/TVCG.2022.3209408. Epub 2022 Dec 20.

sciCAN: single-cell chromatin accessibility and gene expression data integration via cycle-consistent adversarial network.sciCAN：基于循环一致对抗网络的单细胞染色质可及性和基因表达数据整合。

NPJ Syst Biol Appl. 2022 Sep 12;8(1):33. doi: 10.1038/s41540-022-00245-6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

从机器学习范式的角度审视计算生物学中的组学数据整合。

Omics data integration in computational biology viewed through the prism of machine learning paradigms.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献