Gaizauskas R, Demetriou G, Artymiuk P J, Willett P
Department of Computer Science, University of Sheffield, Western Bank, UK.
Bioinformatics. 2003 Jan;19(1):135-43. doi: 10.1093/bioinformatics/19.1.135.
The rapid increase in volume of protein structure literature means useful information may be hidden or lost in the published literature and the process of finding relevant material, sometimes the rate-determining factor in new research, may be arduous and slow.
We describe the Protein Active Site Template Acquisition (PASTA) system, which addresses these problems by performing automatic extraction of information relating to the roles of specific amino acid residues in protein molecules from online scientific articles and abstracts. Both the terminology recognition and extraction capabilities of the system have been extensively evaluated against manually annotated data and the results compare favourably with state-of-the-art results obtained in less challenging domains. PASTA is the first information extraction (IE) system developed for the protein structure domain and one of the most thoroughly evaluated IE system operating on biological scientific text to date.
PASTA makes its extraction results available via a browser-based front end: http://www.dcs.shef.ac.uk/nlp/pasta/. The evaluation resources (manually annotated corpora) are also available through the website: http://www.dcs.shef.ac.uk/nlp/pasta/results.html.
蛋白质结构文献数量的迅速增长意味着有用信息可能隐藏或遗失在已发表的文献中,而查找相关材料的过程(有时是新研究中的限速因素)可能既艰巨又缓慢。
我们描述了蛋白质活性位点模板获取(PASTA)系统,该系统通过从在线科学文章和摘要中自动提取与蛋白质分子中特定氨基酸残基作用相关的信息来解决这些问题。该系统的术语识别和提取能力已针对人工标注数据进行了广泛评估,结果与在难度较低领域获得的最先进结果相比具有优势。PASTA是首个为蛋白质结构领域开发的信息提取(IE)系统,也是迄今为止在生物科学文本上运行的评估最全面的IE系统之一。
PASTA通过基于浏览器的前端提供其提取结果:http://www.dcs.shef.ac.uk/nlp/pasta/。评估资源(人工标注语料库)也可通过网站获取:http://www.dcs.shef.ac.uk/nlp/pasta/results.html。