Department of Biochemistry and Molecular Biology II, University of Granada, 18071 Granada, Spain.
"José Mataix Verdú" Institute of Nutrition and Food Technology (INYTA), Center of Biomedical Research, University of Granada, 18100 Granada, Spain.
Genes (Basel). 2023 Jan 18;14(2):248. doi: 10.3390/genes14020248.
The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate pre-processing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work.
在过去几年中,机器学习技术在构建疾病结果预测模型(基于组学和其他类型的分子数据)方面在生物医学领域得到了极大的关注。尽管如此,组学研究和机器学习工具的精湛技艺仍取决于算法的正确应用以及输入组学和分子数据的适当预处理和管理。目前,许多用于预测目的的基于组学数据使用机器学习的现有方法在以下几个关键步骤中都会犯错:实验设计、特征选择、数据预处理和算法选择。因此,我们提出当前的工作作为如何应对多组学人类数据固有的主要挑战的指导方针。为此,还为定义的每个步骤提出了一系列最佳实践和建议。特别是,描述了每个组学数据层的主要特点、每个源的最合适预处理方法,以及使用机器学习预测疾病发展的最佳实践和技巧的汇编。我们使用真实数据的示例来说明如何解决多组学研究中提到的关键问题(例如,生物学异质性、技术噪声、高维性、缺失值的存在和类不平衡)。最后,我们根据发现的结果定义了模型改进的建议,这些建议为未来的工作提供了基础。