Santos Carlos, Eggle Daniela, States David J
Bioinformatics Program, The University of Michigan, Ann Arbor, MI 48109, USA.
Bioinformatics. 2005 Apr 15;21(8):1653-8. doi: 10.1093/bioinformatics/bti165. Epub 2004 Nov 25.
Wnt signaling is a very active area of research with highly relevant publications appearing at a rate of more than one per day. Building and maintaining databases describing signal transduction networks is a time-consuming and demanding task that requires careful literature analysis and extensive domain-specific knowledge. For instance, more than 50 factors involved in Wnt signal transduction have been identified as of late 2003. In this work we describe a natural language processing (NLP) system that is able to identify references to biological interaction networks in free text and automatically assembles a protein association and interaction map.
A 'gold standard' set of names and assertions was derived by manual scanning of the Wnt genes website (http://www.stanford.edu/~rnusse/wntwindow.html) including 53 interactions involved in Wnt signaling. This system was used to analyze a corpus of peer-reviewed articles related to Wnt signaling including 3369 Pubmed and 1230 full text papers. Names for key Wnt-pathway associated proteins and biological entities are identified using a chi-squared analysis of noun phrases over-represented in the Wnt literature as compared to the general signal transduction literature. Interestingly, we identified several instances where generic terms were used on the website when more specific terms occur in the literature, and one typographic error on the Wnt canonical pathway. Using the named entity list and performing an exhaustive assertion extraction of the corpus, 34 of the 53 interactions in the 'gold standard' Wnt signaling set were successfully identified (64% recall). In addition, the automated extraction found several interactions involving key Wnt-related molecules which were missing or different from those in the canonical diagram, and these were confirmed by manual review of the text. These results suggest that a combination of NLP techniques for information extraction can form a useful first-pass tool for assisting human annotation and maintenance of signal pathway databases.
The pipeline software components are freely available on request to the authors.
http://stateslab.bioinformatics.med.umich.edu/software.html.
Wnt信号传导是一个非常活跃的研究领域,每天都有大量相关出版物问世。构建和维护描述信号转导网络的数据库是一项耗时且要求很高的任务,需要仔细的文献分析和广泛的特定领域知识。例如,截至2003年底,已鉴定出50多种参与Wnt信号转导的因子。在这项工作中,我们描述了一种自然语言处理(NLP)系统,该系统能够识别自由文本中对生物相互作用网络的引用,并自动组装蛋白质关联和相互作用图谱。
管道软件组件可根据作者要求免费提供。
http://stateslab.bioinformatics.med.umich.edu/software.html。