National Institute of Informatics, Tokyo, Japan.
Int J Med Inform. 2009 Dec;78(12):e47-58. doi: 10.1016/j.ijmedinf.2009.03.010. Epub 2009 May 15.
This paper explores the benefits of using n-grams and semantic features for the classification of disease outbreak reports, in the context of the BioCaster disease outbreak report text mining system. A novel feature of this work is the use of a general purpose semantic tagger - the USAS tagger - to generate features.
We outline the application context for this work (the BioCaster epidemiological text mining system), before going on to describe the experimental data used in our classification experiments (the 1000 document BioCaster corpus). FEATURE SETS: Three broad groups of features are used in this work: Named Entity based features, n-gram features, and features derived from the USAS semantic tagger.
Three standard machine learning algorithms - Naïve Bayes, the Support Vector Machine algorithm, and the C4.5 decision tree algorithm - were used for classifying experimental data (that is, the BioCaster corpus). Feature selection was performed using the chi(2) feature selection algorithm. Standard text classification performance metrics - Accuracy, Precision, Recall, Specificity and F-score - are reported.
A feature representation composed of unigrams, bigrams, trigrams and features derived from a semantic tagger, in conjunction with the Naïve Bayes algorithm and feature selection yielded the highest classification accuracy (and F-score). This result was statistically significant compared to a baseline unigram representation and to previous work on the same task. However, it was feature selection rather than semantic tagging that contributed most to the improved performance.
This study has shown that for the classification of disease outbreak reports, a combination of bag-of-words, n-grams and semantic features, in conjunction with feature selection, increases classification accuracy at a statistically significant level compared to previous work in this domain.
本文探讨了在 BioCaster 疾病爆发报告文本挖掘系统的背景下,使用 n 元组和语义特征对疾病爆发报告进行分类的优势。这项工作的一个新颖特点是使用通用语义标记器——USAS 标记器来生成特征。
在介绍这项工作的应用背景(BioCaster 流行病学文本挖掘系统)之前,我们先描述了用于分类实验的数据(BioCaster 语料库中的 1000 篇文档)。
这项工作使用了三大类特征:基于命名实体的特征、n 元组特征和来自 USAS 语义标记器的特征。
使用三种标准机器学习算法——朴素贝叶斯、支持向量机算法和 C4.5 决策树算法——对实验数据(即 BioCaster 语料库)进行分类。使用 chi(2)特征选择算法进行特征选择。报告了标准文本分类性能指标——准确性、精度、召回率、特异性和 F 分数。
与一元特征表示和同一任务的先前工作相比,由单字、双字、三字和语义标记器派生的特征与朴素贝叶斯算法和特征选择相结合的特征表示产生了最高的分类准确性(和 F 分数)。这一结果在统计学上是显著的。然而,对性能的提高贡献最大的是特征选择而不是语义标记。
本研究表明,对于疾病爆发报告的分类,在这个领域的先前工作基础上,使用词袋、n 元组和语义特征相结合,并结合特征选择,可以在统计学上显著提高分类准确性。