Department of Computer Science and Software Engineering, Concordia University, 1455 de Maisonneuve Blvd, West, Montréal, Canada.
BMC Bioinformatics. 2012 Jun 26;13 Suppl 11(Suppl 11):S7. doi: 10.1186/1471-2105-13-S11-S7.
In recent years, biological event extraction has emerged as a key natural language processing task, aiming to address the information overload problem in accessing the molecular biology literature. The BioNLP shared task competitions have contributed to this recent interest considerably. The first competition (BioNLP'09) focused on extracting biological events from Medline abstracts from a narrow domain, while the theme of the latest competition (BioNLP-ST'11) was generalization and a wider range of text types, event types, and subject domains were considered. We view event extraction as a building block in larger discourse interpretation and propose a two-phase, linguistically-grounded, rule-based methodology. In the first phase, a general, underspecified semantic interpretation is composed from syntactic dependency relations in a bottom-up manner. The notion of embedding underpins this phase and it is informed by a trigger dictionary and argument identification rules. Coreference resolution is also performed at this step, allowing extraction of inter-sentential relations. The second phase is concerned with constraining the resulting semantic interpretation by shared task specifications. We evaluated our general methodology on core biological event extraction and speculation/negation tasks in three main tracks of BioNLP-ST'11 (GENIA, EPI, and ID).
We achieved competitive results in GENIA and ID tracks, while our results in the EPI track leave room for improvement. One notable feature of our system is that its performance across abstracts and articles bodies is stable. Coreference resolution results in minor improvement in system performance. Due to our interest in discourse-level elements, such as speculation/negation and coreference, we provide a more detailed analysis of our system performance in these subtasks.
The results demonstrate the viability of a robust, linguistically-oriented methodology, which clearly distinguishes general semantic interpretation from shared task specific aspects, for biological event extraction. Our error analysis pinpoints some shortcomings, which we plan to address in future work within our incremental system development methodology.
近年来,生物事件抽取已成为自然语言处理领域的一个关键任务,旨在解决在访问分子生物学文献时面临的信息过载问题。BioNLP 共享任务竞赛在很大程度上促进了这一新兴领域的发展。第一次竞赛(BioNLP'09)专注于从窄领域的 Medline 摘要中提取生物事件,而最新竞赛(BioNLP-ST'11)的主题是概括和更广泛的文本类型、事件类型和主题领域。我们将事件抽取视为更大的话语解释中的一个构建块,并提出了一种两阶段、基于语言的、基于规则的方法。在第一阶段,从自底向上的句法依存关系中组成一个通用的、未指定的语义解释。嵌入的概念是这一阶段的基础,它由触发器字典和参数识别规则来提供信息。在这一步也进行了共指消解,允许提取句子间的关系。第二阶段则是根据共享任务的规范来约束生成的语义解释。我们在三个主要的 BioNLP-ST'11 (GENIA、EPI 和 ID)跟踪中评估了我们的通用方法在核心生物事件抽取和推测/否定任务上的性能。
我们在 GENIA 和 ID 跟踪中取得了有竞争力的结果,而在 EPI 跟踪中的结果还有改进的空间。我们系统的一个显著特点是,它在摘要和文章主体中的表现都很稳定。共指消解结果略微提高了系统的性能。由于我们对语篇层面的元素,如推测/否定和共指,感兴趣,因此我们对这些子任务中的系统性能进行了更详细的分析。
结果表明,一种稳健的、基于语言的方法具有可行性,这种方法清楚地区分了一般语义解释和共享任务的特定方面,非常适合生物事件抽取。我们的错误分析指出了一些不足之处,我们计划在未来的工作中,在我们的增量系统开发方法中解决这些问题。