Lin Wutao, Ji Donghong, Lu Yanan
School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China.
School of Computer, Wuhan University, Wuhan, 430072, China.
BMC Bioinformatics. 2017 Jan 31;18(1):75. doi: 10.1186/s12859-017-1476-4.
Information extraction in clinical texts enables medical workers to find out problems of patients faster as well as makes intelligent diagnosis possible in the future. There has been a lot of work about disorder mention recognition in clinical narratives. But recognition of some more complicated disorder mentions like overlapping ones is still an open issue. This paper proposes a multi-label structured Support Vector Machine (SVM) based method for disorder mention recognition. We present a multi-label scheme which could be used in complicated entity recognition tasks.
We performed three sets of experiments to evaluate our model. Our best F-Score on the 2013 Conference and Labs of the Evaluation Forum data set is 0.7343. There are six types of labels in our multi-label scheme, all of which are represented by 24-bit binary numbers. The binary digits of each label contain information about different disorder mentions. Our multi-label method can recognize not only disorder mentions in the form of contiguous or discontiguous words but also mentions whose spans overlap with each other. The experiments indicate that our multi-label structured SVM model outperforms the condition random field (CRF) model for this disorder mention recognition task. The experiments show that our multi-label scheme surpasses the baseline. Especially for overlapping disorder mentions, the F-Score of our multi-label scheme is 0.1428 higher than the baseline BIOHD1234 scheme.
This multi-label structured SVM based approach is demonstrated to work well with this disorder recognition task. The novel multi-label scheme we presented is superior to the baseline and it can be used in other models to solve various types of complicated entity recognition tasks as well.
临床文本中的信息提取能够帮助医护人员更快地发现患者的问题,并为未来的智能诊断提供可能。关于临床叙述中疾病提及识别的研究已经有很多。但是,识别一些更复杂的疾病提及,如重叠的疾病提及,仍然是一个未解决的问题。本文提出了一种基于多标签结构化支持向量机(SVM)的疾病提及识别方法。我们提出了一种可用于复杂实体识别任务的多标签方案。
我们进行了三组实验来评估我们的模型。我们在2013年评测论坛会议和实验室数据集上的最佳F值为0.7343。我们的多标签方案中有六种类型的标签,所有标签均由24位二进制数表示。每个标签的二进制数字包含有关不同疾病提及的信息。我们的多标签方法不仅可以识别连续或不连续单词形式的疾病提及,还可以识别跨度相互重叠的提及。实验表明,对于该疾病提及识别任务,我们的多标签结构化支持向量机模型优于条件随机场(CRF)模型。实验表明,我们的多标签方案优于基线。特别是对于重叠的疾病提及,我们的多标签方案的F值比基线BIOHD1234方案高0.1428。
这种基于多标签结构化支持向量机的方法在该疾病识别任务中表现良好。我们提出的新颖多标签方案优于基线,并且它也可用于其他模型以解决各种类型的复杂实体识别任务。