Jiang Min, Denny Josh C, Tang Buzhou, Cao Hongxin, Xu Hua
Department of Biomedical Informatics, Vanderbilt University, School of Medicine, Nashville, TN, USA.
AMIA Annu Symp Proc. 2012;2012:409-16. Epub 2012 Nov 3.
Semantic lexicons that link words and phrases to specific semantic types such as diseases are valuable assets for clinical natural language processing (NLP) systems. Although terminological terms with predefined semantic types can be generated easily from existing knowledge bases such as the Unified Medical Language Systems (UMLS), they are often limited and do not have good coverage for narrative clinical text. In this study, we developed a method for building semantic lexicons from clinical corpus. It extracts candidate semantic terms using a conditional random field (CRF) classifier and then selects terms using the C-Value algorithm. We applied the method to a corpus containing 10 years of discharge summaries from Vanderbilt University Hospital (VUH) and extracted 44,957 new terms for three semantic groups: Problem, Treatment, and Test. A manual analysis of 200 randomly selected terms not found in the UMLS demonstrated that 59% of them were meaningful new clinical concepts and 25% were lexical variants of exiting concepts in the UMLS. Furthermore, we compared the effectiveness of corpus-derived and UMLS-derived semantic lexicons in the concept extraction task of the 2010 i2b2 clinical NLP challenge. Our results showed that the classifier with corpus-derived semantic lexicons as features achieved a better performance (F-score 82.52%) than that with UMLS-derived semantic lexicons as features (F-score 82.04%). We conclude that such corpus-based methods are effective for generating semantic lexicons, which may improve named entity recognition tasks and may aid in augmenting synonymy within existing terminologies.
将单词和短语与特定语义类型(如疾病)相关联的语义词典是临床自然语言处理(NLP)系统的宝贵资产。虽然可以从诸如统一医学语言系统(UMLS)等现有知识库中轻松生成具有预定义语义类型的术语,但它们往往有限,对叙述性临床文本的覆盖范围不佳。在本研究中,我们开发了一种从临床语料库构建语义词典的方法。它使用条件随机场(CRF)分类器提取候选语义术语,然后使用C值算法选择术语。我们将该方法应用于包含范德比尔特大学医院(VUH)10年出院小结的语料库,并为三个语义组(问题、治疗和检查)提取了44,957个新术语。对200个在UMLS中未找到的随机选择术语进行的人工分析表明,其中59%是有意义的新临床概念,25%是UMLS中现有概念的词汇变体。此外,我们在2010年i2b2临床NLP挑战赛的概念提取任务中比较了源自语料库和源自UMLS的语义词典的有效性。我们的结果表明,以源自语料库的语义词典为特征的分类器(F值82.52%)比以源自UMLS的语义词典为特征的分类器(F值82.04%)表现更好。我们得出结论,这种基于语料库的方法对于生成语义词典是有效的,这可能会改善命名实体识别任务,并可能有助于扩充现有术语中的同义词。