Mihăilă Claudiu, Ananiadou Sophia
The National Centre for Text Mining, School of Computer Science, The University of Manchester, 131 Princess Street, Manchester M1 7DN, United Kingdom.
J Bioinform Comput Biol. 2013 Dec;11(6):1343008. doi: 10.1142/S0219720013430087. Epub 2013 Dec 2.
Current domain-specific information extraction systems represent an important resource for biomedical researchers, who need to process vast amounts of knowledge in a short time. Automatic discourse causality recognition can further reduce their workload by suggesting possible causal connections and aiding in the curation of pathway models. We describe here an approach to the automatic identification of discourse causality triggers in the biomedical domain using machine learning. We create several baselines and experiment with and compare various parameter settings for three algorithms, i.e. Conditional Random Fields (CRF), Support Vector Machines (SVM) and Random Forests (RF). We also evaluate the impact of lexical, syntactic, and semantic features on each of the algorithms, showing that semantics improves the performance in all cases. We test our comprehensive feature set on two corpora containing gold standard annotations of causal relations, and demonstrate the need for more gold standard data. The best performance of 79.35% F-score is achieved by CRFs when using all three feature types.
当前特定领域的信息提取系统是生物医学研究人员的重要资源,他们需要在短时间内处理大量知识。自动语篇因果关系识别可以通过提出可能的因果联系并协助整理通路模型,进一步减轻他们的工作量。我们在此描述一种使用机器学习在生物医学领域自动识别语篇因果关系触发因素的方法。我们创建了几个基线,并对三种算法(即条件随机场(CRF)、支持向量机(SVM)和随机森林(RF))的各种参数设置进行实验和比较。我们还评估了词汇、句法和语义特征对每种算法的影响,结果表明语义在所有情况下都能提高性能。我们在两个包含因果关系金标准注释的语料库上测试了我们的综合特征集,并证明了对更多金标准数据的需求。当使用所有三种特征类型时,CRF取得了79.35%的F值的最佳性能。