Doan Son, Xu Hua
Department of Biomedical Informatics, School of Medicine, Vanderbilt University.
Proc Int Conf Comput Ling. 2010 Aug;2010:259-266.
Due to the lack of annotated data sets, there are few studies on machine learning based approaches to extract named entities (NEs) in clinical text. The 2009 i2b2 NLP challenge is a task to extract six types of medication related NEs, including medication names, dosage, mode, frequency, duration, and reason from hospital discharge summaries. Several machine learning based systems have been developed and showed good performance in the challenge. Those systems often involve two steps: 1) recognition of medication related entities; and 2) determination of the relation between a medication name and its modifiers (e.g., dosage). A few machine learning algorithms including Conditional Random Field (CRF) and Maximum Entropy have been applied to the Named Entity Recognition (NER) task at the first step. In this study, we developed a Support Vector Machine (SVM) based method to recognize medication related entities. In addition, we systematically investigated various types of features for NER in clinical text. Evaluation on 268 manually annotated discharge summaries from i2b2 challenge showed that the SVM-based NER system achieved the best F-score of 90.05% (93.20% Precision, 87.12% Recall), when semantic features generated from a rule-based system were included.
由于缺乏带注释的数据集,基于机器学习的方法从临床文本中提取命名实体(NE)的研究很少。2009年的i2b2自然语言处理挑战赛是一项从医院出院小结中提取六种与药物相关的命名实体的任务,包括药物名称、剂量、服用方式、频率、持续时间和用药原因。已经开发了几个基于机器学习的系统,并在挑战赛中表现出良好的性能。这些系统通常涉及两个步骤:1)识别与药物相关的实体;2)确定药物名称与其修饰词(如剂量)之间的关系。包括条件随机场(CRF)和最大熵在内的一些机器学习算法已应用于第一步的命名实体识别(NER)任务。在本研究中,我们开发了一种基于支持向量机(SVM)的方法来识别与药物相关的实体。此外,我们系统地研究了临床文本中NER的各种特征类型。对i2b2挑战赛中268份人工标注的出院小结进行评估,结果表明,当包含基于规则系统生成的语义特征时,基于SVM的NER系统获得了最佳F值,为90.05%(精确率93.20%,召回率87.12%)。