利用 n 元组和语义特征对疾病爆发报告进行分类。

Classifying disease outbreak reports using n-grams and semantic features.

机构信息

National Institute of Informatics, Tokyo, Japan.

出版信息

Int J Med Inform. 2009 Dec;78(12):e47-58. doi: 10.1016/j.ijmedinf.2009.03.010. Epub 2009 May 15.

DOI:10.1016/j.ijmedinf.2009.03.010

PMID:19447070

Abstract

INTRODUCTION

This paper explores the benefits of using n-grams and semantic features for the classification of disease outbreak reports, in the context of the BioCaster disease outbreak report text mining system. A novel feature of this work is the use of a general purpose semantic tagger - the USAS tagger - to generate features.

BACKGROUND

We outline the application context for this work (the BioCaster epidemiological text mining system), before going on to describe the experimental data used in our classification experiments (the 1000 document BioCaster corpus). FEATURE SETS: Three broad groups of features are used in this work: Named Entity based features, n-gram features, and features derived from the USAS semantic tagger.

METHODOLOGY

Three standard machine learning algorithms - Naïve Bayes, the Support Vector Machine algorithm, and the C4.5 decision tree algorithm - were used for classifying experimental data (that is, the BioCaster corpus). Feature selection was performed using the chi(2) feature selection algorithm. Standard text classification performance metrics - Accuracy, Precision, Recall, Specificity and F-score - are reported.

RESULTS

A feature representation composed of unigrams, bigrams, trigrams and features derived from a semantic tagger, in conjunction with the Naïve Bayes algorithm and feature selection yielded the highest classification accuracy (and F-score). This result was statistically significant compared to a baseline unigram representation and to previous work on the same task. However, it was feature selection rather than semantic tagging that contributed most to the improved performance.

CONCLUSION

This study has shown that for the classification of disease outbreak reports, a combination of bag-of-words, n-grams and semantic features, in conjunction with feature selection, increases classification accuracy at a statistically significant level compared to previous work in this domain.

摘要

简介

本文探讨了在 BioCaster 疾病爆发报告文本挖掘系统的背景下，使用 n 元组和语义特征对疾病爆发报告进行分类的优势。这项工作的一个新颖特点是使用通用语义标记器——USAS 标记器来生成特征。

背景

在介绍这项工作的应用背景（BioCaster 流行病学文本挖掘系统）之前，我们先描述了用于分类实验的数据（BioCaster 语料库中的 1000 篇文档）。

特征集

这项工作使用了三大类特征：基于命名实体的特征、n 元组特征和来自 USAS 语义标记器的特征。

方法

使用三种标准机器学习算法——朴素贝叶斯、支持向量机算法和 C4.5 决策树算法——对实验数据（即 BioCaster 语料库）进行分类。使用 chi(2)特征选择算法进行特征选择。报告了标准文本分类性能指标——准确性、精度、召回率、特异性和 F 分数。

结果

与一元特征表示和同一任务的先前工作相比，由单字、双字、三字和语义标记器派生的特征与朴素贝叶斯算法和特征选择相结合的特征表示产生了最高的分类准确性（和 F 分数）。这一结果在统计学上是显著的。然而，对性能的提高贡献最大的是特征选择而不是语义标记。

结论

本研究表明，对于疾病爆发报告的分类，在这个领域的先前工作基础上，使用词袋、n 元组和语义特征相结合，并结合特征选择，可以在统计学上显著提高分类准确性。

相似文献

Classifying disease outbreak reports using n-grams and semantic features.

Int J Med Inform. 2009 Dec;78(12):e47-58. doi: 10.1016/j.ijmedinf.2009.03.010. Epub 2009 May 15.

Towards role-based filtering of disease outbreak reports.

J Biomed Inform. 2009 Oct;42(5):773-80. doi: 10.1016/j.jbi.2008.12.009. Epub 2008 Dec 31.

Comparison of character-level and part of speech features for name recognition in biomedical texts.

J Biomed Inform. 2004 Dec;37(6):423-35. doi: 10.1016/j.jbi.2004.08.008.

A methodology to enhance spatial understanding of disease outbreak events reported in news articles.

Int J Med Inform. 2010 Apr;79(4):284-96. doi: 10.1016/j.ijmedinf.2010.01.014. Epub 2010 Feb 13.

Implementation and evaluation of a negation tagger in a pipeline-based system for information extract from pathology reports.

Stud Health Technol Inform. 2004;107(Pt 1):663-7.

Bio-medical entity extraction using support vector machines.

Artif Intell Med. 2005 Feb;33(2):125-37. doi: 10.1016/j.artmed.2004.07.019.

Recognizing names in biomedical texts: a machine learning approach.

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

Feature selection and classification model construction on type 2 diabetic patients' data.

Artif Intell Med. 2007 Nov;41(3):251-62. doi: 10.1016/j.artmed.2007.07.002. Epub 2007 Aug 17.

Performance analysis of a POS tagger applied to discharge summaries in Portuguese.

Stud Health Technol Inform. 2010;160(Pt 2):959-63.

A thousand words in a scene.

IEEE Trans Pattern Anal Mach Intell. 2007 Sep;29(9):1575-89. doi: 10.1109/TPAMI.2007.1155.

引用本文的文献

Elaboration of a new framework for fine-grained epidemiological annotation.

Sci Data. 2022 Oct 26;9(1):655. doi: 10.1038/s41597-022-01743-2.

PADI-web 3.0: A new framework for extracting and disseminating fine-grained information from the news for animal disease surveillance.

One Health. 2021 Dec 3;13:100357. doi: 10.1016/j.onehlt.2021.100357. eCollection 2021 Dec.

Automated Classification of Online Sources for Infectious Disease Occurrences Using Machine-Learning-Based Natural Language Processing Approaches.

Int J Environ Res Public Health. 2020 Dec 17;17(24):9467. doi: 10.3390/ijerph17249467.

Defining facets of social distancing during the COVID-19 pandemic: Twitter analysis.

J Biomed Inform. 2020 Nov;111:103601. doi: 10.1016/j.jbi.2020.103601. Epub 2020 Oct 14.

Automatic Annotation of Narrative Radiology Reports.

Diagnostics (Basel). 2020 Apr 1;10(4):196. doi: 10.3390/diagnostics10040196.

Design Choices for Automated Disease Surveillance in the Social Web.

Online J Public Health Inform. 2018 Sep 21;10(2):e214. doi: 10.5210/ojphi.v10i2.9312. eCollection 2018.

Web monitoring of emerging animal infectious diseases integrated in the French Animal Health Epidemic Intelligence System.

PLoS One. 2018 Aug 3;13(8):e0199960. doi: 10.1371/journal.pone.0199960. eCollection 2018.

Classifying Chinese Questions Related to Health Care Posted by Consumers Via the Internet.

J Med Internet Res. 2017 Jun 20;19(6):e220. doi: 10.2196/jmir.7156.

Talking about Climate Change and Global Warming.

PLoS One. 2015 Sep 29;10(9):e0138996. doi: 10.1371/journal.pone.0138996. eCollection 2015.

Natural language processing methods for enhancing geographic metadata for phylogeography of zoonotic viruses.

AMIA Jt Summits Transl Sci Proc. 2014 Apr 7;2014:102-11. eCollection 2014.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用 n 元组和语义特征对疾病爆发报告进行分类。

Classifying disease outbreak reports using n-grams and semantic features.

机构信息

National Institute of Informatics, Tokyo, Japan.

出版信息

Int J Med Inform. 2009 Dec;78(12):e47-58. doi: 10.1016/j.ijmedinf.2009.03.010. Epub 2009 May 15.

DOI:10.1016/j.ijmedinf.2009.03.010

PMID:19447070

Abstract

INTRODUCTION

BACKGROUND

METHODOLOGY

RESULTS

CONCLUSION

摘要

简介

背景

在介绍这项工作的应用背景（BioCaster 流行病学文本挖掘系统）之前，我们先描述了用于分类实验的数据（BioCaster 语料库中的 1000 篇文档）。

特征集

这项工作使用了三大类特征：基于命名实体的特征、n 元组特征和来自 USAS 语义标记器的特征。

利用 n 元组和语义特征对疾病爆发报告进行分类。

Classifying disease outbreak reports using n-grams and semantic features.

机构信息

出版信息

INTRODUCTION

BACKGROUND

METHODOLOGY

RESULTS

CONCLUSION

简介

背景

特征集

方法

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

利用 n 元组和语义特征对疾病爆发报告进行分类。

Classifying disease outbreak reports using n-grams and semantic features.

机构信息

出版信息

INTRODUCTION

BACKGROUND

METHODOLOGY

RESULTS

CONCLUSION

简介

背景

特征集

方法

结果

结论

相似文献

引用本文的文献