在PubMed规模上对基因/蛋白质关联进行的分析。

An analysis of gene/protein associations at PubMed scale.

作者信息

Pyysalo Sampo, Ohta Tomoko, Tsujii Jun'ichi

机构信息

Department of Computer Science, University of Tokyo, Tokyo, Japan.

出版信息

J Biomed Semantics. 2011 Oct 6;2 Suppl 5(Suppl 5):S5. doi: 10.1186/2041-1480-2-S5-S5.

DOI:10.1186/2041-1480-2-S5-S5

PMID:22166173

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3239305/

Abstract

BACKGROUND

Event extraction following the GENIA Event corpus and BioNLP shared task models has been a considerable focus of recent work in biomedical information extraction. This work includes efforts applying event extraction methods to the entire PubMed literature database, far beyond the narrow subdomains of biomedicine for which annotated resources for extraction method development are available.

RESULTS

In the present study, our aim is to estimate the coverage of all statements of gene/protein associations in PubMed that existing resources for event extraction can provide. We base our analysis on a recently released corpus automatically annotated for gene/protein entities and syntactic analyses covering the entire PubMed, and use named entity co-occurrence, shortest dependency paths and an unlexicalized classifier to identify likely statements of gene/protein associations. A set of high-frequency/high-likelihood association statements are then manually analyzed with reference to the GENIA ontology.

CONCLUSIONS

We present a first estimate of the overall coverage of gene/protein associations provided by existing resources for event extraction. Our results suggest that for event-type associations this coverage may be over 90%. We also identify several biologically significant associations of genes and proteins that are not addressed by these resources, suggesting directions for further extension of extraction coverage.

摘要

背景

遵循GENIA事件语料库和生物自然语言处理共享任务模型进行事件提取，一直是生物医学信息提取领域近期工作的重点。这项工作包括将事件提取方法应用于整个PubMed文献数据库，远远超出了有用于提取方法开发的注释资源的狭义生物医学子领域。

结果

在本研究中，我们的目的是估计现有事件提取资源能够提供的PubMed中所有基因/蛋白质关联陈述的覆盖率。我们的分析基于最近发布的一个自动注释了基因/蛋白质实体并涵盖整个PubMed的句法分析语料库，并使用命名实体共现、最短依存路径和一个未词法化的分类器来识别可能的基因/蛋白质关联陈述。然后，参照GENIA本体对一组高频/高可能性关联陈述进行人工分析。