Campos David, Bui Quoc-Chinh, Matos Sérgio, Oliveira José Luís
IEETA/DETI, University of Aveiro, 3810-193, Aveiro, Portugal.
Department of Medical Informatics, Erasmus Medical Centre Rotterdam, Rotterdam, Netherlands.
Source Code Biol Med. 2014 Jan 8;9(1):1. doi: 10.1186/1751-0473-9-1.
Cellular events play a central role in the understanding of biological processes and functions, providing insight on both physiological and pathogenesis mechanisms. Automatic extraction of mentions of such events from the literature represents an important contribution to the progress of the biomedical domain, allowing faster updating of existing knowledge. The identification of trigger words indicating an event is a very important step in the event extraction pipeline, since the following task(s) rely on its output. This step presents various complex and unsolved challenges, namely the selection of informative features, the representation of the textual context, and the selection of a specific event type for a trigger word given this context.
We propose TrigNER, a machine learning-based solution for biomedical event trigger recognition, which takes advantage of Conditional Random Fields (CRFs) with a high-end feature set, including linguistic-based, orthographic, morphological, local context and dependency parsing features. Additionally, a completely configurable algorithm is used to automatically optimize the feature set and training parameters for each event type. Thus, it automatically selects the features that have a positive contribution and automatically optimizes the CRF model order, n-grams sizes, vertex information and maximum hops for dependency parsing features. The final output consists of various CRF models, each one optimized to the linguistic characteristics of each event type.
TrigNER was tested in the BioNLP 2009 shared task corpus, achieving a total F-measure of 62.7 and outperforming existing solutions on various event trigger types, namely gene expression, transcription, protein catabolism, phosphorylation and binding. The proposed solution allows researchers to easily apply complex and optimized techniques in the recognition of biomedical event triggers, making its application a simple routine task. We believe this work is an important contribution to the biomedical text mining community, contributing to improved and faster event recognition on scientific articles, and consequent hypothesis generation and knowledge discovery. This solution is freely available as open source at http://bioinformatics.ua.pt/trigner.
细胞事件在理解生物过程和功能中起着核心作用,有助于洞察生理和发病机制。从文献中自动提取此类事件的提及对生物医学领域的进展具有重要意义,能使现有知识得到更快更新。识别指示事件的触发词是事件提取流程中非常重要的一步,因为后续任务依赖于其输出。这一步存在各种复杂且未解决的挑战,即信息特征的选择、文本上下文的表示以及在此上下文中为触发词选择特定事件类型。
我们提出了TrigNER,这是一种基于机器学习的生物医学事件触发识别解决方案,它利用具有高端特征集的条件随机场(CRF),该特征集包括基于语言、正字法、形态学、局部上下文和依存句法分析特征。此外,使用一种完全可配置的算法为每种事件类型自动优化特征集和训练参数。因此,它会自动选择具有积极贡献的特征,并自动优化CRF模型阶数、n元语法大小、顶点信息以及依存句法分析特征的最大跳数。最终输出由各种CRF模型组成,每个模型都针对每种事件类型的语言特征进行了优化。
TrigNER在BioNLP 2009共享任务语料库中进行了测试,总F值达到62.7,在各种事件触发类型(即基因表达、转录、蛋白质分解代谢、磷酸化和结合)上优于现有解决方案。所提出的解决方案使研究人员能够轻松地在生物医学事件触发识别中应用复杂且优化的技术,使其应用成为一项简单的常规任务。我们相信这项工作对生物医学文本挖掘社区做出了重要贡献,有助于在科学文章上改进和更快地进行事件识别,进而生成假设和发现知识。该解决方案可在http://bioinformatics.ua.pt/trigner上作为开源免费获取。