Verspoor Karin, Mackinlay Andrew, Cohn Judith D, Wall Michael E
National ICT Australia, Victoria Research Lab, Parkville, VIC 3010, Australia.
Pac Symp Biocomput. 2013:433-44.
This paper explores the application of text mining to the problem of detecting protein functional sites in the biomedical literature, and specifically considers the task of identifying catalytic sites in that literature. We provide strong evidence for the need for text mining techniques that address residue-level protein function annotation through an analysis of two corpora in terms of their coverage of curated data sources. We also explore the viability of building a text-based classifier for identifying protein functional sites, identifying the low coverage of curated data sources and the potential ambiguity of information about protein functional sites as challenges that must be addressed. Nevertheless we produce a simple classifier that achieves a reasonable ∼69% F-score on our full text silver corpus on the first attempt to address this classification task. The work has application in computational prediction of the functional significance of protein sites as well as in curation workflows for databases that capture this information.
本文探讨了文本挖掘在生物医学文献中检测蛋白质功能位点问题上的应用,特别考虑了在该文献中识别催化位点的任务。通过对两个语料库在策划数据源覆盖范围方面的分析,我们为需要通过残基水平的蛋白质功能注释来解决的文本挖掘技术提供了有力证据。我们还探讨了构建基于文本的分类器以识别蛋白质功能位点的可行性,确定了策划数据源的低覆盖率以及蛋白质功能位点信息的潜在模糊性是必须解决的挑战。尽管如此,我们首次尝试解决此分类任务时,在全文银语料库上生成了一个简单的分类器,其F值达到了约69%。这项工作在蛋白质位点功能重要性的计算预测以及捕获此信息的数据库的策划工作流程中都有应用。