LIMSI-CNRS, Orsay Cedex, France.
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):588-93. doi: 10.1136/amiajnl-2011-000154. Epub 2011 May 19.
This paper describes the approaches the authors developed while participating in the i2b2/VA 2010 challenge to automatically extract medical concepts and annotate assertions on concepts and relations between concepts.
The authors'approaches rely on both rule-based and machine-learning methods. Natural language processing is used to extract features from the input texts; these features are then used in the authors' machine-learning approaches. The authors used Conditional Random Fields for concept extraction, and Support Vector Machines for assertion and relation annotation. Depending on the task, the authors tested various combinations of rule-based and machine-learning methods.
The authors'assertion annotation system obtained an F-measure of 0.931, ranking fifth out of 21 participants at the i2b2/VA 2010 challenge. The authors' relation annotation system ranked third out of 16 participants with a 0.709 F-measure. The 0.773 F-measure the authors obtained on concept extraction did not make it to the top 10.
On the one hand, the authors confirm that the use of only machine-learning methods is highly dependent on the annotated training data, and thus obtained better results for well-represented classes. On the other hand, the use of only a rule-based method was not sufficient to deal with new types of data. Finally, the use of hybrid approaches combining machine-learning and rule-based approaches yielded higher scores.
本文描述了作者在参与 i2b2/VA 2010 挑战赛时所开发的方法,这些方法用于自动提取医学概念,并对概念以及概念之间的关系进行断言标注。
作者的方法依赖于基于规则和基于机器学习的方法。自然语言处理用于从输入文本中提取特征;然后,这些特征被用于作者的机器学习方法中。作者使用条件随机场进行概念提取,使用支持向量机进行断言和关系标注。根据任务的不同,作者测试了基于规则和基于机器学习的方法的各种组合。
作者的断言标注系统在 i2b2/VA 2010 挑战赛的 21 个参赛队伍中排名第五,获得了 0.931 的 F 值。作者的关系标注系统在 16 个参赛队伍中排名第三,获得了 0.709 的 F 值。作者在概念提取方面获得的 0.773 的 F 值没有进入前 10 名。
一方面,作者证实仅使用机器学习方法高度依赖于标注的训练数据,因此对于代表性良好的类别获得了更好的结果。另一方面,仅使用基于规则的方法不足以处理新类型的数据。最后,使用结合了机器学习和基于规则的方法的混合方法产生了更高的分数。