八种机器学习算法在医学分类任务中对不同特征数据集的适用性研究。

Research on eight machine learning algorithms applicability on different characteristics data sets in medical classification tasks.

作者信息

Zhang Yiyan, Li Qin, Xin Yi

机构信息

School of Intelligent Manufacturing, Qingdao Huanghai University, Qingdao, China.

School of Life Science, Beijing Institute of Technology, Beijing, China.

出版信息

Front Comput Neurosci. 2024 Jan 31;18:1345575. doi: 10.3389/fncom.2024.1345575. eCollection 2024.

DOI:10.3389/fncom.2024.1345575

PMID:38356726

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10864458/

Abstract

With the vigorous development of data mining field, more and more algorithms have been proposed or improved. How to quickly select a data mining algorithm that is suitable for data sets in medical field is a challenge for some medical workers. The purpose of this paper is to study the comparative characteristics of the general medical data set and the general data sets in other fields, and find the applicability rules of the data mining algorithm suitable for the characteristics of the current research data set. The study quantified characteristics of the research data set with 26 indicators, including simple indicators, statistical indicators and information theory indicators. Eight machine learning algorithms with high maturity, low user involvement and strong family representation were selected as the base algorithms. The algorithm performances were evaluated by three aspects: prediction accuracy, running speed and memory consumption. By constructing decision tree and stepwise regression model to learn the above metadata, the algorithm applicability knowledge of medical data set is obtained. Through cross-verification, the accuracy of all the algorithm applicability prediction models is above 75%, which proves the validity and feasibility of the applicability knowledge.

摘要

随着数据挖掘领域的蓬勃发展，越来越多的算法被提出或改进。对于一些医学工作者来说，如何快速选择适合医学领域数据集的数据挖掘算法是一项挑战。本文的目的是研究一般医学数据集与其他领域一般数据集的比较特征，并找出适合当前研究数据集特征的数据挖掘算法的适用规则。该研究用26个指标对研究数据集的特征进行了量化，包括简单指标、统计指标和信息论指标。选择了8种成熟度高、用户参与度低且具有较强家族代表性的机器学习算法作为基础算法。从预测准确性、运行速度和内存消耗三个方面对算法性能进行评估。通过构建决策树和逐步回归模型来学习上述元数据，获得医学数据集的算法适用性知识。通过交叉验证，所有算法适用性预测模型的准确率均在75%以上，证明了适用性知识的有效性和可行性。