Mujtaba Ghulam, Shuib Liyana, Raj Ram Gopal, Rajandram Retnagowri, Shaikh Khairunisa, Al-Garadi Mohammed Ali
Department of Information Systems, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia.
Department of Computer Science, Sukkur Institute of Business Administration, Sukkur, Pakistan.
PLoS One. 2017 Feb 6;12(2):e0170242. doi: 10.1371/journal.pone.0170242. eCollection 2017.
Widespread implementation of electronic databases has improved the accessibility of plaintext clinical information for supplementary use. Numerous machine learning techniques, such as supervised machine learning approaches or ontology-based approaches, have been employed to obtain useful information from plaintext clinical data. This study proposes an automatic multi-class classification system to predict accident-related causes of death from plaintext autopsy reports through expert-driven feature selection with supervised automatic text classification decision models.
Accident-related autopsy reports were obtained from one of the largest hospital in Kuala Lumpur. These reports belong to nine different accident-related causes of death. Master feature vector was prepared by extracting features from the collected autopsy reports by using unigram with lexical categorization. This master feature vector was used to detect cause of death [according to internal classification of disease version 10 (ICD-10) classification system] through five automated feature selection schemes, proposed expert-driven approach, five subset sizes of features, and five machine learning classifiers. Model performance was evaluated using precisionM, recallM, F-measureM, accuracy, and area under ROC curve. Four baselines were used to compare the results with the proposed system.
Random forest and J48 decision models parameterized using expert-driven feature selection yielded the highest evaluation measure approaching (85% to 90%) for most metrics by using a feature subset size of 30. The proposed system also showed approximately 14% to 16% improvement in the overall accuracy compared with the existing techniques and four baselines.
The proposed system is feasible and practical to use for automatic classification of ICD-10-related cause of death from autopsy reports. The proposed system assists pathologists to accurately and rapidly determine underlying cause of death based on autopsy findings. Furthermore, the proposed expert-driven feature selection approach and the findings are generally applicable to other kinds of plaintext clinical reports.
电子数据库的广泛应用提高了纯文本临床信息用于辅助用途的可获取性。众多机器学习技术,如监督式机器学习方法或基于本体的方法,已被用于从纯文本临床数据中获取有用信息。本研究提出一种自动多类别分类系统,通过专家驱动的特征选择和监督式自动文本分类决策模型,从纯文本尸检报告中预测与事故相关的死亡原因。
从吉隆坡最大的医院之一获取与事故相关的尸检报告。这些报告属于九种不同的与事故相关的死亡原因。通过使用带有词汇分类的单字从收集的尸检报告中提取特征,制备主特征向量。该主特征向量用于通过五种自动特征选择方案、提出的专家驱动方法、五种特征子集大小和五种机器学习分类器来检测死亡原因[根据疾病分类第十版(ICD - 10)分类系统]。使用精确率M、召回率M、F值M、准确率和ROC曲线下面积评估模型性能。使用四个基线将结果与所提出的系统进行比较。
使用专家驱动的特征选择进行参数化的随机森林和J48决策模型,在使用大小为30的特征子集时,大多数指标的评估度量最高接近(85%至90%)。与现有技术和四个基线相比,所提出的系统在总体准确率上也显示出约14%至16%的提高。
所提出的系统用于从尸检报告中自动分类与ICD - 10相关的死亡原因是可行且实用的。所提出的系统有助于病理学家根据尸检结果准确快速地确定潜在死亡原因。此外,所提出的专家驱动的特征选择方法和研究结果通常适用于其他类型的纯文本临床报告。