Krallinger Martin, Malik Rainer, Valencia Alfonso
Dep. Struct. Comp. Biology Spanish National Cancer Centre (CNIO), Melchor Fernández Almagro, 3, E-28029 Madrid, Spain.
Genome Inform. 2006;17(2):121-30.
Existing biological knowledge stored as structured database records has been extracted manually by database curators analyzing the scientific literature. Most of this information was derived from sentences which describe biologically relevant aspects of genes and gene products. We introduce the Protein description sentence (Prodisen) corpus, a useful resource for the automatic identification and construction of text-based protein and gene description records using information extraction and text classification techniques. Basic guidelines and criteria relevant for the construction of a text corpus of functional descriptions of genes and proteins are proposed. The steps used for the corpus construction and its features are presented. Moreover, some of the potential applications of the Prodisen corpus for biomedical text mining purposes are explored and the obtained results are presented.
作为结构化数据库记录存储的现有生物学知识已由数据库管理员通过分析科学文献手动提取。这些信息大多来自描述基因和基因产物生物学相关方面的句子。我们引入了蛋白质描述句子(Prodisen)语料库,这是一个利用信息提取和文本分类技术自动识别和构建基于文本的蛋白质和基因描述记录的有用资源。提出了与构建基因和蛋白质功能描述文本语料库相关的基本指南和标准。介绍了语料库构建所采用的步骤及其特点。此外,还探讨了Prodisen语料库在生物医学文本挖掘方面的一些潜在应用,并展示了所获得的结果。