基于机器学习的疾病风险预测的特征选择方法综述

A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction.

作者信息

Pudjihartono Nicholas, Fadason Tayaza, Kempa-Liehr Andreas W, O'Sullivan Justin M

机构信息

Liggins Institute, University of Auckland, Auckland, New Zealand.

Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand.

出版信息

Front Bioinform. 2022 Jun 27;2:927312. doi: 10.3389/fbinf.2022.927312. eCollection 2022.

Machine learning has shown utility in detecting patterns within large, unstructured, and complex datasets. One of the promising applications of machine learning is in precision medicine, where disease risk is predicted using patient genetic data. However, creating an accurate prediction model based on genotype data remains challenging due to the so-called "curse of dimensionality" (i.e., extensively larger number of features compared to the number of samples). Therefore, the generalizability of machine learning models benefits from feature selection, which aims to extract only the most "informative" features and remove noisy "non-informative," irrelevant and redundant features. In this article, we provide a general overview of the different feature selection methods, their advantages, disadvantages, and use cases, focusing on the detection of relevant features (i.e., SNPs) for disease risk prediction.

机器学习已在检测大型、非结构化和复杂数据集中的模式方面展现出实用性。机器学习有前景的应用之一是在精准医学领域，即利用患者基因数据预测疾病风险。然而，由于所谓的“维度诅咒”（即特征数量比样本数量多得多），基于基因型数据创建准确的预测模型仍然具有挑战性。因此，机器学习模型的泛化能力受益于特征选择，其目的是仅提取最“有信息价值”的特征，并去除有噪声的“无信息价值”、不相关和冗余的特征。在本文中，我们提供了不同特征选择方法的概述，包括它们的优点、缺点和用例，重点是检测与疾病风险预测相关的特征（即单核苷酸多态性）。