Fouché Aziz, Zinovyev Andrei
Institut Curie, PSL Research University, Paris, France.
Institut National de la Santé et de la Recherche Médicale, Paris, France.
Front Bioinform. 2023 Aug 4;3:1191961. doi: 10.3389/fbinf.2023.1191961. eCollection 2023.
Important quantities of biological data can today be acquired to characterize cell types and states, from various sources and using a wide diversity of methods, providing scientists with more and more information to answer challenging biological questions. Unfortunately, working with this amount of data comes at the price of ever-increasing data complexity. This is caused by the multiplication of data types and batch effects, which hinders the joint usage of all available data within common analyses. Data integration describes a set of tasks geared towards embedding several datasets of different origins or modalities into a joint representation that can then be used to carry out downstream analyses. In the last decade, dozens of methods have been proposed to tackle the different facets of the data integration problem, relying on various paradigms. This review introduces the most common data types encountered in computational biology and provides systematic definitions of the data integration problems. We then present how machine learning innovations were leveraged to build effective data integration algorithms, that are widely used today by computational biologists. We discuss the current state of data integration and important pitfalls to consider when working with data integration tools. We eventually detail a set of challenges the field will have to overcome in the coming years.
如今,可以从各种来源并使用多种多样的方法获取大量重要的生物学数据,以表征细胞类型和状态,这为科学家提供了越来越多的信息来回答具有挑战性的生物学问题。不幸的是,处理如此大量的数据是以数据复杂性不断增加为代价的。这是由数据类型的增加和批次效应导致的,这阻碍了在常规分析中对所有可用数据的联合使用。数据整合描述了一组任务,旨在将几个不同来源或模态的数据集嵌入到一个联合表示中,然后可用于进行下游分析。在过去十年中,已经提出了几十种方法来解决数据整合问题的不同方面,这些方法依赖于各种范式。本综述介绍了计算生物学中遇到的最常见数据类型,并提供了数据整合问题的系统定义。然后,我们展示了如何利用机器学习创新来构建有效的数据整合算法,这些算法如今被计算生物学家广泛使用。我们讨论了数据整合的现状以及在使用数据整合工具时需要考虑的重要陷阱。我们最终详细阐述了该领域在未来几年必须克服的一系列挑战。