Dai Hong-Jie, Singh Onkar, Jonnagaddala Jitendra, Su Emily Chia-Yu
Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan Interdisciplinary Program of Green and Information Technology, National Taitung University, Taitung, Taiwan
Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan.
Database (Oxford). 2016 Jul 27;2016. doi: 10.1093/database/baw111. Print 2016.
In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery.Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus.
近年来,随着研究人员专注于生物领域以研究生物对象(如基因和蛋白质)的功能,已发表的生物医学文章数量有所增加。然而,基因及其产物的模糊性质使得文献对于分子相互作用数据库的读者和管理者来说更加复杂。为应对这一挑战,应用了一种能将生物对象的变体链接到单一标准化形式的归一化技术。在这项工作中,我们开发了一个物种归一化模块,该模块识别物种名称并将其归一化为NCBI分类学ID。与大多数之前忽略代表基因所属物种名称缩写的基因名称前缀的工作不同,我们模块的识别结果包括带前缀的物种。所开发的物种归一化模块在实例级物种归一化语料库上的总体F分数达到了0.954。对于基因归一化,分别采用了两个独立的模块来识别基因提及,并通过利用为处理全文文章而开发的多阶段归一化算法将这些提及归一化为它们的Entrez基因ID。所有开发的模块都是与BioC兼容的.NET框架库,可从NuGet库中公开获取。数据库网址:https://sites.google.com/site/hjdairesearch/Projects/isn-corpus 。