Zeng Daohui, Peng Jidong, Fong Simon, Qiu Yining, Wong Raymond
First Affiliated Hospital of Guangzhou University of TCM, Guangzhou, People's Republic of China.
Ganzhou People's Hospital, Jiangxi, People's Republic of China.
Australas Phys Eng Sci Med. 2018 Dec;41(4):1087-1100. doi: 10.1007/s13246-018-0674-3. Epub 2018 Sep 11.
In this paper, we propose a novel technique termed as optimized swarm search-based feature selection (OS-FS), which is a swarm-type of searching function that selects an ideal subset of features for enhanced classification accuracy. In terms of gaining insights from unstructured medical based texts, sentiment prediction is becoming an increasingly crucial machine learning technique. In fact, due to its robustness and accuracy, it recently gained popularity in the medical industries. Medical text mining is well known as a fundamental data analytic for sentiment prediction. To form a high-dimensional sparse matrix, a popular preprocessing step in text mining is employed to transform medical text strings to word vectors. However, such a sparse matrix poses problems to the induction of accurate sentiment prediction model. The swarm search in our proposed OS-FS can be optimized by a new feature evaluation technique called clustering-by-coefficient-of-variation. In order to find a subset of features from all the original features from the sparse matrix, this type of feature selection has been a commonly utilized dimensionality reduction technique, and has the capability to improve accuracy of the prediction model. We implement this method based on a case scenario where 279 medical articles related to 'meaningful use functionalities on health care quality, safety, and efficiency' from a systematic review of previous medical IT literature. For this medical text mining, a multi-class of sentiments, positive, mixed-positive, neutral and negative is recognized from the document contents. Our experimental results demonstrate the superiority of OS-FS over traditional feature selection methods in literature.
在本文中,我们提出了一种名为基于优化群搜索的特征选择(OS-FS)的新技术,它是一种群类型的搜索函数,用于选择理想的特征子集以提高分类准确率。就从非结构化医学文本中获取见解而言,情感预测正成为一种越来越重要的机器学习技术。事实上,由于其稳健性和准确性,它最近在医疗行业中受到欢迎。医学文本挖掘是众所周知的情感预测的基本数据分析方法。为了形成高维稀疏矩阵,在文本挖掘中采用一种流行的预处理步骤将医学文本字符串转换为词向量。然而,这样的稀疏矩阵给准确的情感预测模型的归纳带来了问题。我们提出的OS-FS中的群搜索可以通过一种名为变异系数聚类的新特征评估技术进行优化。为了从稀疏矩阵的所有原始特征中找到一个特征子集,这种类型的特征选择一直是一种常用的降维技术,并且有能力提高预测模型的准确性。我们基于一个案例场景实现了该方法,该场景来自对先前医学信息技术文献的系统综述中279篇与“医疗保健质量、安全和效率方面的有意义使用功能”相关的医学文章。对于这种医学文本挖掘,从文档内容中识别出多类情感,即积极、混合积极、中性和消极。我们的实验结果证明了OS-FS相对于文献中传统特征选择方法的优越性。