Oh So-Yeon, Kim Ji-Hyeon, Kim Seo-Jin, Nam Hee-Jo, Park Hyun-Seok
Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea.
Center for Convergence Research of Advanced Technologies, Ewha Womans University, Seoul 03760, Korea.
Genomics Inform. 2018 Sep;16(3):75-77. doi: 10.5808/GI.2018.16.3.75. Epub 2018 Sep 30.
Genomics & Informatics (NLM title abbreviation: Genomics Inform) is the official journal of the Korea Genome Organization. Text corpus for this journal annotated with various levels of linguistic information would be a valuable resource as the process of information extraction requires syntactic, semantic, and higher levels of natural language processing. In this study, we publish our new corpus called GNI Corpus version 1.0, extracted and annotated from full texts of Genomics & Informatics, with NLTK (Natural Language ToolKit)-based text mining script. The preliminary version of the corpus could be used as a training and testing set of a system that serves a variety of functions for future biomedical text mining.
《基因组学与信息学》(NLM 标题缩写:Genomics Inform)是韩国基因组组织的官方期刊。由于信息提取过程需要句法、语义及更高层次的自然语言处理,因此标注有不同语言信息层次的该期刊文本语料库将是一种宝贵资源。在本研究中,我们发布了名为 GNI 语料库 1.0 版的新语料库,它是使用基于 NLTK(自然语言工具包)的文本挖掘脚本从《基因组学与信息学》的全文中提取并标注的。该语料库的初步版本可作为一个系统的训练和测试集,该系统可为未来的生物医学文本挖掘提供多种功能。