Nguyen Quang Long, Tikk Domonkos, Leser Ulf
Knowledge Management in Bioinformatics, Department for Computer Science, Humboldt-Universität zu Berlin, Germany.
J Biomed Semantics. 2010 Sep 24;1(1):9. doi: 10.1186/2041-1480-1-9.
Pattern-based approaches to relation extraction have shown very good results in many areas of biomedical text mining. However, defining the right set of patterns is difficult; approaches are either manual, incurring high cost, or automatic, often resulting in large sets of noisy patterns.
We propose several techniques for filtering sets of automatically generated patterns and analyze their effectiveness for different extraction tasks, as defined in the recent BioNLP 2009 shared task. We focus on simple methods that only take into account the complexity of the pattern and the complexity of the texts the patterns are applied to. We show that our techniques, despite their simplicity, yield large improvements in all tasks we analyzed. For instance, they raise the F-score for the task of extraction gene expression events from 24.8% to 51.9%.
Already very simple filtering techniques may improve the F-score of an information extraction method based on automatically generated patterns significantly. Furthermore, the application of such methods yields a considerable speed-up, as fewer matches need to be analysed. Due to their simplicity, the proposed filtering techniques also should be applicable to other methods using linguistic patterns for information extraction.
基于模式的关系提取方法在生物医学文本挖掘的许多领域都取得了很好的成果。然而,定义合适的模式集很困难;方法要么是手动的,成本高昂,要么是自动的,往往会产生大量噪声模式。
我们提出了几种用于过滤自动生成的模式集的技术,并分析了它们在最近的2009年生物自然语言处理共享任务中定义的不同提取任务中的有效性。我们专注于仅考虑模式的复杂性以及应用模式的文本的复杂性的简单方法。我们表明,我们的技术尽管简单,但在我们分析的所有任务中都带来了很大的改进。例如,它们将从文本中提取基因表达事件任务的F值从24.8%提高到了51.9%。
非常简单的过滤技术就可能显著提高基于自动生成模式的信息提取方法的F值。此外,应用这些方法可以显著加快速度,因为需要分析的匹配项更少。由于其简单性,所提出的过滤技术也应该适用于其他使用语言模式进行信息提取的方法。