IEETA/DETI, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.
BMC Bioinformatics. 2013 Sep 24;14:281. doi: 10.1186/1471-2105-14-281.
Concept recognition is an essential task in biomedical information extraction, presenting several complex and unsolved challenges. The development of such solutions is typically performed in an ad-hoc manner or using general information extraction frameworks, which are not optimized for the biomedical domain and normally require the integration of complex external libraries and/or the development of custom tools.
This article presents Neji, an open source framework optimized for biomedical concept recognition built around four key characteristics: modularity, scalability, speed, and usability. It integrates modules for biomedical natural language processing, such as sentence splitting, tokenization, lemmatization, part-of-speech tagging, chunking and dependency parsing. Concept recognition is provided through dictionary matching and machine learning with normalization methods. Neji also integrates an innovative concept tree implementation, supporting overlapped concept names and respective disambiguation techniques. The most popular input and output formats, namely Pubmed XML, IeXML, CoNLL and A1, are also supported. On top of the built-in functionalities, developers and researchers can implement new processing modules or pipelines, or use the provided command-line interface tool to build their own solutions, applying the most appropriate techniques to identify heterogeneous biomedical concepts. Neji was evaluated against three gold standard corpora with heterogeneous biomedical concepts (CRAFT, AnEM and NCBI disease corpus), achieving high performance results on named entity recognition (F1-measure for overlap matching: species 95%, cell 92%, cellular components 83%, gene and proteins 76%, chemicals 65%, biological processes and molecular functions 63%, disorders 85%, and anatomical entities 82%) and on entity normalization (F1-measure for overlap name matching and correct identifier included in the returned list of identifiers: species 88%, cell 71%, cellular components 72%, gene and proteins 64%, chemicals 53%, and biological processes and molecular functions 40%). Neji provides fast and multi-threaded data processing, annotating up to 1200 sentences/second when using dictionary-based concept identification.
Considering the provided features and underlying characteristics, we believe that Neji is an important contribution to the biomedical community, streamlining the development of complex concept recognition solutions. Neji is freely available at http://bioinformatics.ua.pt/neji.
概念识别是生物医学信息提取中的一项基本任务,它提出了几个复杂且未解决的挑战。此类解决方案的开发通常是临时进行的,或者使用不针对生物医学领域优化的通用信息提取框架,并且通常需要集成复杂的外部库和/或开发定制工具。
本文介绍了 Neji,这是一个为生物医学概念识别而优化的开源框架,它围绕四个关键特性构建:模块化、可扩展性、速度和可用性。它集成了用于生物医学自然语言处理的模块,例如句子分割、标记化、词干化、词性标注、分词和依存句法分析。概念识别通过字典匹配和带有规范化方法的机器学习提供。Neji 还集成了一种创新的概念树实现,支持重叠的概念名称和各自的消歧技术。最流行的输入和输出格式,即 Pubmed XML、IeXML、CoNLL 和 A1,也得到了支持。除了内置功能外,开发人员和研究人员还可以实现新的处理模块或管道,或使用提供的命令行界面工具来构建自己的解决方案,应用最合适的技术来识别异构生物医学概念。Neji 在三个具有异构生物医学概念的黄金标准语料库(CRAFT、AnEM 和 NCBI 疾病语料库)上进行了评估,在命名实体识别方面取得了很高的性能(重叠匹配的 F1 度量:物种 95%、细胞 92%、细胞成分 83%、基因和蛋白质 76%、化学物质 65%、生物过程和分子功能 63%、疾病 85%、解剖实体 82%)和实体规范化(重叠名称匹配的 F1 度量和包含在标识符返回列表中的正确标识符:物种 88%、细胞 71%、细胞成分 72%、基因和蛋白质 64%、化学物质 53%、生物过程和分子功能 40%)。Neji 提供快速的多线程数据处理,在使用基于字典的概念识别时,每分钟可注释多达 1200 个句子。
考虑到提供的功能和基本特征,我们认为 Neji 是生物医学社区的一项重要贡献,简化了复杂概念识别解决方案的开发。Neji 可在 http://bioinformatics.ua.pt/neji 上免费获得。