Department of Computer Science & Engineering, JIIT, Noida, India.
Advanced Knowledge Engineering Center, Global Biomedical Technologies, Inc., Roseville, CA, USA.
J Med Syst. 2018 Apr 13;42(5):97. doi: 10.1007/s10916-018-0941-6.
A machine learning (ML)-based text classification system has several classifiers. The performance evaluation (PE) of the ML system is typically driven by the training data size and the partition protocols used. Such systems lead to low accuracy because the text classification systems lack the ability to model the input text data in terms of noise characteristics. This research study proposes a concept of misrepresentation ratio (MRR) on input healthcare text data and models the PE criteria for validating the hypothesis. Further, such a novel system provides a platform to amalgamate several attributes of the ML system such as: data size, classifier type, partitioning protocol and percentage MRR. Our comprehensive data analysis consisted of five types of text data sets (TwitterA, WebKB4, Disease, Reuters (R8), and SMS); five kinds of classifiers (support vector machine with linear kernel (SVM-L), MLP-based neural network, AdaBoost, stochastic gradient descent and decision tree); and five types of training protocols (K2, K4, K5, K10 and JK). Using the decreasing order of MRR, our ML system demonstrates the mean classification accuracies as: 70.13 ± 0.15%, 87.34 ± 0.06%, 93.73 ± 0.03%, 94.45 ± 0.03% and 97.83 ± 0.01%, respectively, using all the classifiers and protocols. The corresponding AUC is 0.98 for SMS data using Multi-Layer Perceptron (MLP) based neural network. All the classifiers, the best accuracy of 91.84 ± 0.04% is shown to be of MLP-based neural network and this is 6% better over previously published. Further we observed that as MRR decreases, the system robustness increases and validated by standard deviations. The overall text system accuracy using all data types, classifiers, protocols is 89%, thereby showing the entire ML system to be novel, robust and unique. The system is also tested for stability and reliability.
基于机器学习 (ML) 的文本分类系统有几个分类器。ML 系统的性能评估 (PE) 通常由训练数据大小和使用的分区协议驱动。由于文本分类系统缺乏根据噪声特征对输入文本数据进行建模的能力,因此此类系统的准确性较低。本研究提出了输入医疗保健文本数据中的表示不当比例 (MRR) 的概念,并对验证假设的 PE 标准进行建模。此外,这种新颖的系统提供了一个平台,可以合并 ML 系统的几个属性,例如:数据大小、分类器类型、分区协议和百分比 MRR。我们的综合数据分析包括五种类型的文本数据集(TwitterA、WebKB4、疾病、Reuters(R8)和 SMS);五种分类器(带线性核的支持向量机 (SVM-L)、基于 MLP 的神经网络、AdaBoost、随机梯度下降和决策树);和五种训练协议(K2、K4、K5、K10 和 JK)。使用 MRR 的降序排列,我们的 ML 系统使用所有分类器和协议分别展示了以下平均分类精度:70.13 ± 0.15%、87.34 ± 0.06%、93.73 ± 0.03%、94.45 ± 0.03% 和 97.83 ± 0.01%。使用基于 MLP 的神经网络对 SMS 数据,相应的 AUC 为 0.98。所有分类器的最佳精度为 91.84 ± 0.04%,这比之前发表的结果提高了 6%。此外,我们观察到随着 MRR 的降低,系统的稳健性增加,并通过标准偏差进行验证。使用所有数据类型、分类器和协议的整个文本系统的准确率为 89%,从而表明整个 ML 系统是新颖的、稳健的和独特的。该系统还经过了稳定性和可靠性测试。