Labory Justine, Njomgue-Fotso Evariste, Bottini Silvia
Université Côte d'Azur, Center of Modeling Simulation and Interactions, Nice, France.
INRAE, Université Côte d'Azur, CNRS, Institut Sophia Agrobiotech, Sophia-Antipolis, France.
Comput Struct Biotechnol J. 2024 Mar 19;23:1274-1287. doi: 10.1016/j.csbj.2024.03.016. eCollection 2024 Dec.
Classification tasks are an open challenge in the field of biomedicine. While several machine-learning techniques exist to accomplish this objective, several peculiarities associated with biomedical data, especially when it comes to omics measurements, prevent their use or good performance achievements. Omics approaches aim to understand a complex biological system through systematic analysis of its content at the molecular level. On the other hand, omics data are heterogeneous, sparse and affected by the classical "curse of dimensionality" problem, i.e. having much fewer observation, samples () than omics features (). Furthermore, a major problem with multi-omics data is the imbalance either at the class or feature level. The objective of this work is to study whether feature extraction and/or feature selection techniques can improve the performances of classification machine-learning algorithms on omics measurements.
Among all omics, metabolomics has emerged as a powerful tool in cancer research, facilitating a deeper understanding of the complex metabolic landscape associated with tumorigenesis and tumor progression. Thus, we selected three publicly available metabolomics datasets, and we applied several feature extraction techniques both linear and non-linear, coupled or not with feature selection methods, and evaluated the performances regarding patient classification in the different configurations for the three datasets.
We provide general workflow and guidelines on when to use those techniques depending on the characteristics of the data available. To further test the extension of our approach to other omics data, we have included a transcriptomics and a proteomics data. Overall, for all datasets, we showed that applying supervised feature selection improves the performances of feature extraction methods for classification purposes. Scripts used to perform all analyses are available at: https://github.com/Plant-Net/Metabolomic_project/.
分类任务在生物医学领域是一项公开的挑战。虽然存在多种机器学习技术来实现这一目标,但生物医学数据存在一些特性,特别是在涉及组学测量时,会妨碍这些技术的使用或取得良好的性能表现。组学方法旨在通过在分子水平上对生物系统内容进行系统分析来理解复杂的生物系统。另一方面,组学数据具有异质性、稀疏性,并受到经典的“维度诅咒”问题的影响,即观测值、样本数量远少于组学特征数量。此外,多组学数据的一个主要问题是在类别或特征层面存在不平衡。本研究的目的是探讨特征提取和/或特征选择技术是否能提高分类机器学习算法在组学测量上的性能。
在所有组学中,代谢组学已成为癌症研究中的一种强大工具,有助于更深入地了解与肿瘤发生和肿瘤进展相关的复杂代谢格局。因此,我们选择了三个公开可用的代谢组学数据集,并应用了几种线性和非线性的特征提取技术,这些技术与特征选择方法结合或不结合,然后评估了这三个数据集在不同配置下进行患者分类的性能。
我们根据可用数据的特征,提供了关于何时使用这些技术的一般工作流程和指导原则。为了进一步测试我们的方法对其他组学数据的适用性,我们纳入了一个转录组学数据和一个蛋白质组学数据。总体而言,对于所有数据集,我们表明应用监督特征选择可提高用于分类目的的特征提取方法的性能。用于执行所有分析的脚本可在以下网址获取:https://github.com/Plant-Net/Metabolomic_project/ 。