GENIA语料库——用于生物文本挖掘的语义标注语料库。

GENIA corpus--semantically annotated corpus for bio-textmining.

作者信息

Kim J-D, Ohta T, Tateisi Y, Tsujii J

机构信息

CREST, Japan Science and Technology Corporation, Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan.

出版信息

Bioinformatics. 2003;19 Suppl 1:i180-2. doi: 10.1093/bioinformatics/btg1023.

DOI:10.1093/bioinformatics/btg1023

PMID:12855455

Abstract

MOTIVATION

Natural language processing (NLP) methods are regarded as being useful to raise the potential of text mining from biological literature. The lack of an extensively annotated corpus of this literature, however, causes a major bottleneck for applying NLP techniques. GENIA corpus is being developed to provide reference materials to let NLP techniques work for bio-textmining.

RESULTS

GENIA corpus version 3.0 consisting of 2000 MEDLINE abstracts has been released with more than 400,000 words and almost 100,000 annotations for biological terms.

摘要