Suppr超能文献

使用机器学习和C值方法从出院小结中提取语义词典。

Extracting semantic lexicons from discharge summaries using machine learning and the C-Value method.

作者信息

Jiang Min, Denny Josh C, Tang Buzhou, Cao Hongxin, Xu Hua

机构信息

Department of Biomedical Informatics, Vanderbilt University, School of Medicine, Nashville, TN, USA.

出版信息

AMIA Annu Symp Proc. 2012;2012:409-16. Epub 2012 Nov 3.

Abstract

Semantic lexicons that link words and phrases to specific semantic types such as diseases are valuable assets for clinical natural language processing (NLP) systems. Although terminological terms with predefined semantic types can be generated easily from existing knowledge bases such as the Unified Medical Language Systems (UMLS), they are often limited and do not have good coverage for narrative clinical text. In this study, we developed a method for building semantic lexicons from clinical corpus. It extracts candidate semantic terms using a conditional random field (CRF) classifier and then selects terms using the C-Value algorithm. We applied the method to a corpus containing 10 years of discharge summaries from Vanderbilt University Hospital (VUH) and extracted 44,957 new terms for three semantic groups: Problem, Treatment, and Test. A manual analysis of 200 randomly selected terms not found in the UMLS demonstrated that 59% of them were meaningful new clinical concepts and 25% were lexical variants of exiting concepts in the UMLS. Furthermore, we compared the effectiveness of corpus-derived and UMLS-derived semantic lexicons in the concept extraction task of the 2010 i2b2 clinical NLP challenge. Our results showed that the classifier with corpus-derived semantic lexicons as features achieved a better performance (F-score 82.52%) than that with UMLS-derived semantic lexicons as features (F-score 82.04%). We conclude that such corpus-based methods are effective for generating semantic lexicons, which may improve named entity recognition tasks and may aid in augmenting synonymy within existing terminologies.

摘要

将单词和短语与特定语义类型(如疾病)相关联的语义词典是临床自然语言处理(NLP)系统的宝贵资产。虽然可以从诸如统一医学语言系统(UMLS)等现有知识库中轻松生成具有预定义语义类型的术语,但它们往往有限,对叙述性临床文本的覆盖范围不佳。在本研究中,我们开发了一种从临床语料库构建语义词典的方法。它使用条件随机场(CRF)分类器提取候选语义术语,然后使用C值算法选择术语。我们将该方法应用于包含范德比尔特大学医院(VUH)10年出院小结的语料库,并为三个语义组(问题、治疗和检查)提取了44,957个新术语。对200个在UMLS中未找到的随机选择术语进行的人工分析表明,其中59%是有意义的新临床概念,25%是UMLS中现有概念的词汇变体。此外,我们在2010年i2b2临床NLP挑战赛的概念提取任务中比较了源自语料库和源自UMLS的语义词典的有效性。我们的结果表明,以源自语料库的语义词典为特征的分类器(F值82.52%)比以源自UMLS的语义词典为特征的分类器(F值82.04%)表现更好。我们得出结论,这种基于语料库的方法对于生成语义词典是有效的,这可能会改善命名实体识别任务,并可能有助于扩充现有术语中的同义词。

相似文献

3
Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis.
J Am Med Inform Assoc. 2012 Jun;19(e1):e149-56. doi: 10.1136/amiajnl-2011-000744. Epub 2012 Apr 4.
4
Development and evaluation of RapTAT: a machine learning system for concept mapping of phrases from medical narratives.
J Biomed Inform. 2014 Apr;48:54-65. doi: 10.1016/j.jbi.2013.11.008. Epub 2013 Dec 4.
5
Assessment of disease named entity recognition on a corpus of annotated sentences.
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.
6
Towards a semantic lexicon for clinical natural language processing.
AMIA Annu Symp Proc. 2012;2012:568-76. Epub 2012 Nov 3.
7
Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features.
BMC Med Inform Decis Mak. 2013;13 Suppl 1(Suppl 1):S1. doi: 10.1186/1472-6947-13-S1-S1. Epub 2013 Apr 5.
8
A semantic lexicon for medical language processing.
J Am Med Inform Assoc. 1999 May-Jun;6(3):205-18. doi: 10.1136/jamia.1999.0060205.
9
A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries.
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):601-6. doi: 10.1136/amiajnl-2011-000163. Epub 2011 Apr 20.
10
Identifying named entities from PubMed for enriching semantic categories.
BMC Bioinformatics. 2015 Feb 21;16:57. doi: 10.1186/s12859-015-0487-2.

引用本文的文献

1
Mining clinical phrases from nursing notes to discover risk factors of patient deterioration.
Int J Med Inform. 2020 Mar;135:104053. doi: 10.1016/j.ijmedinf.2019.104053. Epub 2019 Dec 14.
2
Automated concept and relationship extraction for the semi-automated ontology management (SEAM) system.
J Biomed Semantics. 2015 Apr 2;6:15. doi: 10.1186/s13326-015-0011-7. eCollection 2015.
3
Biobanks and electronic medical records: enabling cost-effective research.
Sci Transl Med. 2014 Apr 30;6(234):234cm3. doi: 10.1126/scitranslmed.3008604.

本文引用的文献

1
An evaluation of the UMLS in representing corpus derived clinical concepts.
AMIA Annu Symp Proc. 2011;2011:435-44. Epub 2011 Oct 22.
2
2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):552-6. doi: 10.1136/amiajnl-2011-000203. Epub 2011 Jun 16.
3
A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries.
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):601-6. doi: 10.1136/amiajnl-2011-000163. Epub 2011 Apr 20.
4
Identifying risk factors for metabolic syndrome in biomedical text.
AMIA Annu Symp Proc. 2007 Oct 11;2007:249-53.
6
Towards a semantic lexicon for biological language processing.
Comp Funct Genomics. 2005;6(1-2):61-6. doi: 10.1002/cfg.451.
7
Development of a large-scale de-identified DNA biobank to enable personalized medicine.
Clin Pharmacol Ther. 2008 Sep;84(3):362-9. doi: 10.1038/clpt.2008.89. Epub 2008 May 21.
8
Term identification methods for consumer health vocabulary development.
J Med Internet Res. 2007 Feb 28;9(1):e4. doi: 10.2196/jmir.9.1.e4.
9
Unified medical language system coverage of emergency-medicine chief complaints.
Acad Emerg Med. 2006 Dec;13(12):1319-23. doi: 10.1197/j.aem.2006.06.054. Epub 2006 Nov 1.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验