iProLINK:用于文献挖掘的综合蛋白质资源。

iProLINK: an integrated protein resource for literature mining.

作者信息

Hu Zhang-Zhi, Mani Inderjeet, Hermoso Vincent, Liu Hongfang, Wu Cathy H

机构信息

Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20057, USA.

出版信息

Comput Biol Chem. 2004 Dec;28(5-6):409-16. doi: 10.1016/j.compbiolchem.2004.09.010.

Abstract

The exponential growth of large-scale molecular sequence data and of the PubMed scientific literature has prompted active research in biological literature mining and information extraction to facilitate genome/proteome annotation and improve the quality of biological databases. Motivated by the promise of text mining methodologies, but at the same time, the lack of adequate curated data for training and benchmarking, the Protein Information Resource (PIR) has developed a resource for protein literature mining--iProLINK (integrated Protein Literature INformation and Knowledge). As PIR focuses its effort on the curation of the UniProt protein sequence database, the goal of iProLINK is to provide curated data sources that can be utilized for text mining research in the areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. The data sources for bibliography mapping and annotation extraction include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes several hundred abstracts and full-text articles tagged with experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database. The data sources for entity recognition and ontology development include a protein name dictionary, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, as well as a protein ontology based on PIRSF protein family names. iProLINK is freely accessible at http://pir.georgetown.edu/iprolink, with hypertext links for all downloadable files.

摘要

大规模分子序列数据以及PubMed科学文献的指数级增长,促使人们积极开展生物文献挖掘和信息提取研究,以促进基因组/蛋白质组注释并提高生物数据库的质量。受文本挖掘方法前景的激励,但同时也因缺乏足够的经过整理的数据用于训练和基准测试,蛋白质信息资源库(PIR)开发了一种用于蛋白质文献挖掘的资源——iProLINK(综合蛋白质文献信息与知识)。由于PIR将工作重点放在UniProt蛋白质序列数据库的整理上,iProLINK的目标是提供经过整理的数据源,可用于文献映射、注释提取、蛋白质命名实体识别和蛋白质本体开发等领域的文本挖掘研究。文献映射和注释提取的数据源包括映射引用(PubMed ID到蛋白质条目和特征行映射)以及带注释标记的文献语料库。后者包括数百篇用PIR蛋白质序列数据库中注释的经过实验验证的翻译后修饰(PTM)标记的摘要和全文文章。实体识别和本体开发的数据源包括蛋白质名称词典、单词令牌词典、带蛋白质名称标记的文献语料库以及标记指南,还有基于PIRSF蛋白质家族名称的蛋白质本体。可通过http://pir.georgetown.edu/iprolink免费访问iProLINK,所有可下载文件都有超文本链接。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索