Shivade Chaitanya, Malewadkar Pranav, Fosler-Lussier Eric, Lai Albert M
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA.
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S103-S110. doi: 10.1016/j.jbi.2015.08.025. Epub 2015 Sep 12.
The second track of the 2014 i2b2 challenge asked participants to automatically identify risk factors for heart disease among diabetic patients using natural language processing techniques for clinical notes. This paper describes a rule-based system developed using a combination of regular expressions, concepts from the Unified Medical Language System (UMLS), and freely-available resources from the community. With a performance (F1=90.7) that is significantly higher than the median (F1=87.20) and close to the top performing system (F1=92.8), it was the best rule-based system of all the submissions in the challenge. We also used this system to evaluate the utility of different terminologies in the UMLS towards the challenge task. Of the 155 terminologies in the UMLS, 129 (76.78%) have no representation in the corpus. The Consumer Health Vocabulary had very good coverage of relevant concepts and was the most useful terminology for the challenge task. While segmenting notes into sections and lists has a significant impact on the performance, identifying negations and experiencer of the medical event results in negligible gain.
2014年i2b2挑战赛的第二个赛道要求参与者使用自然语言处理技术对临床记录进行自动识别糖尿病患者的心脏病风险因素。本文描述了一个基于规则的系统,该系统结合了正则表达式、统一医学语言系统(UMLS)中的概念以及社区中免费可用的资源开发而成。其性能(F1=90.7)显著高于中位数(F1=87.20)且接近表现最佳的系统(F1=92.8),是挑战赛所有提交作品中最佳的基于规则的系统。我们还使用该系统评估了UMLS中不同术语对挑战任务的效用。在UMLS的155个术语中,有129个(76.78%)在语料库中没有体现。消费者健康词汇对相关概念的覆盖范围非常好,是挑战任务中最有用的术语。虽然将记录分割成章节和列表对性能有显著影响,但识别医疗事件的否定和经历者带来的增益可忽略不计。