Pyysalo Sampo, Ohta Tomoko, Tsujii Jun'ichi
Department of Computer Science, University of Tokyo, Tokyo, Japan.
J Biomed Semantics. 2011 Oct 6;2 Suppl 5(Suppl 5):S5. doi: 10.1186/2041-1480-2-S5-S5.
Event extraction following the GENIA Event corpus and BioNLP shared task models has been a considerable focus of recent work in biomedical information extraction. This work includes efforts applying event extraction methods to the entire PubMed literature database, far beyond the narrow subdomains of biomedicine for which annotated resources for extraction method development are available.
In the present study, our aim is to estimate the coverage of all statements of gene/protein associations in PubMed that existing resources for event extraction can provide. We base our analysis on a recently released corpus automatically annotated for gene/protein entities and syntactic analyses covering the entire PubMed, and use named entity co-occurrence, shortest dependency paths and an unlexicalized classifier to identify likely statements of gene/protein associations. A set of high-frequency/high-likelihood association statements are then manually analyzed with reference to the GENIA ontology.
We present a first estimate of the overall coverage of gene/protein associations provided by existing resources for event extraction. Our results suggest that for event-type associations this coverage may be over 90%. We also identify several biologically significant associations of genes and proteins that are not addressed by these resources, suggesting directions for further extension of extraction coverage.
遵循GENIA事件语料库和生物自然语言处理共享任务模型进行事件提取,一直是生物医学信息提取领域近期工作的重点。这项工作包括将事件提取方法应用于整个PubMed文献数据库,远远超出了有用于提取方法开发的注释资源的狭义生物医学子领域。
在本研究中,我们的目的是估计现有事件提取资源能够提供的PubMed中所有基因/蛋白质关联陈述的覆盖率。我们的分析基于最近发布的一个自动注释了基因/蛋白质实体并涵盖整个PubMed的句法分析语料库,并使用命名实体共现、最短依存路径和一个未词法化的分类器来识别可能的基因/蛋白质关联陈述。然后,参照GENIA本体对一组高频/高可能性关联陈述进行人工分析。
我们首次估计了现有事件提取资源对基因/蛋白质关联的总体覆盖率。我们的结果表明,对于事件类型的关联,这一覆盖率可能超过90%。我们还识别出了这些资源未涉及的几个具有生物学意义的基因和蛋白质关联,为进一步扩大提取覆盖率指明了方向。