Suppr超能文献

使用增强型SMOTE和混沌进化特征选择的临床数据分类

Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection.

作者信息

Sreejith S, Khanna Nehemiah H, Kannan A

机构信息

Ramanujan Computing Centre, Anna University, Chennai, 600025, Tamil Nadu, India.

Ramanujan Computing Centre, Anna University, Chennai, 600025, Tamil Nadu, India.

出版信息

Comput Biol Med. 2020 Nov;126:103991. doi: 10.1016/j.compbiomed.2020.103991. Epub 2020 Sep 18.

Abstract

Class imbalance and the presence of irrelevant or redundant features in training data can pose serious challenges to the development of a classification framework. This paper proposes a framework for developing a Clinical Decision Support System (CDSS) that addresses class imbalance and the feature selection problem. Under this framework, the dataset is balanced at the data level and a wrapper approach is used to perform feature selection. The following three clinical datasets from the University of California Irvine (UCI) machine learning repository were used for experimentation: the Indian Liver Patient Dataset (ILPD), the Thoracic Surgery Dataset (TSD) and the Pima Indian Diabetes (PID) dataset. The Synthetic Minority Over-sampling Technique (SMOTE), which was enhanced using Orchard's algorithm, was used to balance the datasets. A wrapper approach that uses Chaotic Multi-Verse Optimisation (CMVO) was proposed for feature subset selection. The arithmetic mean of the Matthews correlation coefficient (MCC) and F-score (F1), which was measured using a Random Forest (RF) classifier, was used as the fitness function. After selecting the relevant features, a RF, which comprises 100 estimators and uses the Information Gain Ratio as the split criteria, was used for classification. The classifier achieved a 0.65 MCC, a 0.84 F1 and 82.46% accuracy for the ILPD; a 0.74 MCC, a 0.87 F1 and 86.88% accuracy for the TSD; and a 0.78 MCC, a 0.89 F1and 89.04% accuracy for the PID dataset. The effects of balancing and feature selection on the classifier were investigated and the performance of the framework was compared with the existing works in the literature. The results showed that the proposed framework is competitive in terms of the three performance measures used. The results of a Wilcoxon test confirmed the statistical superiority of the proposed method.

摘要

训练数据中的类别不平衡以及无关或冗余特征的存在,可能会给分类框架的开发带来严峻挑战。本文提出了一个用于开发临床决策支持系统(CDSS)的框架,该框架解决了类别不平衡和特征选择问题。在此框架下,数据集在数据层面进行了平衡处理,并采用包装法进行特征选择。使用了来自加利福尼亚大学欧文分校(UCI)机器学习库的以下三个临床数据集进行实验:印度肝病患者数据集(ILPD)、胸外科手术数据集(TSD)和皮马印第安人糖尿病(PID)数据集。采用了使用奥查德算法增强的合成少数类过采样技术(SMOTE)来平衡数据集。提出了一种使用混沌多宇宙优化(CMVO)的包装法用于特征子集选择。使用随机森林(RF)分类器测量的马修斯相关系数(MCC)和F分数(F1)的算术平均值用作适应度函数。在选择相关特征后,使用包含100个估计器并以信息增益比作为分裂标准的RF进行分类。对于ILPD数据集,该分类器的MCC为0.65,F1为0.84,准确率为82.46%;对于TSD数据集,MCC为0.74,F1为0.87,准确率为86.88%;对于PID数据集,MCC为0.78,F1为0.89,准确率为89.04%。研究了平衡和特征选择对分类器的影响,并将该框架的性能与文献中的现有工作进行了比较。结果表明所提出的框架在所使用的三个性能指标方面具有竞争力。威尔科克森检验的结果证实了所提方法在统计上的优越性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验