• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生物医学概念识别的模块化框架。

A modular framework for biomedical concept recognition.

机构信息

IEETA/DETI, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.

出版信息

BMC Bioinformatics. 2013 Sep 24;14:281. doi: 10.1186/1471-2105-14-281.

DOI:10.1186/1471-2105-14-281
PMID:24063607
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3849280/
Abstract

BACKGROUND

Concept recognition is an essential task in biomedical information extraction, presenting several complex and unsolved challenges. The development of such solutions is typically performed in an ad-hoc manner or using general information extraction frameworks, which are not optimized for the biomedical domain and normally require the integration of complex external libraries and/or the development of custom tools.

RESULTS

This article presents Neji, an open source framework optimized for biomedical concept recognition built around four key characteristics: modularity, scalability, speed, and usability. It integrates modules for biomedical natural language processing, such as sentence splitting, tokenization, lemmatization, part-of-speech tagging, chunking and dependency parsing. Concept recognition is provided through dictionary matching and machine learning with normalization methods. Neji also integrates an innovative concept tree implementation, supporting overlapped concept names and respective disambiguation techniques. The most popular input and output formats, namely Pubmed XML, IeXML, CoNLL and A1, are also supported. On top of the built-in functionalities, developers and researchers can implement new processing modules or pipelines, or use the provided command-line interface tool to build their own solutions, applying the most appropriate techniques to identify heterogeneous biomedical concepts. Neji was evaluated against three gold standard corpora with heterogeneous biomedical concepts (CRAFT, AnEM and NCBI disease corpus), achieving high performance results on named entity recognition (F1-measure for overlap matching: species 95%, cell 92%, cellular components 83%, gene and proteins 76%, chemicals 65%, biological processes and molecular functions 63%, disorders 85%, and anatomical entities 82%) and on entity normalization (F1-measure for overlap name matching and correct identifier included in the returned list of identifiers: species 88%, cell 71%, cellular components 72%, gene and proteins 64%, chemicals 53%, and biological processes and molecular functions 40%). Neji provides fast and multi-threaded data processing, annotating up to 1200 sentences/second when using dictionary-based concept identification.

CONCLUSIONS

Considering the provided features and underlying characteristics, we believe that Neji is an important contribution to the biomedical community, streamlining the development of complex concept recognition solutions. Neji is freely available at http://bioinformatics.ua.pt/neji.

摘要

背景

概念识别是生物医学信息提取中的一项基本任务,它提出了几个复杂且未解决的挑战。此类解决方案的开发通常是临时进行的,或者使用不针对生物医学领域优化的通用信息提取框架,并且通常需要集成复杂的外部库和/或开发定制工具。

结果

本文介绍了 Neji,这是一个为生物医学概念识别而优化的开源框架,它围绕四个关键特性构建:模块化、可扩展性、速度和可用性。它集成了用于生物医学自然语言处理的模块,例如句子分割、标记化、词干化、词性标注、分词和依存句法分析。概念识别通过字典匹配和带有规范化方法的机器学习提供。Neji 还集成了一种创新的概念树实现,支持重叠的概念名称和各自的消歧技术。最流行的输入和输出格式,即 Pubmed XML、IeXML、CoNLL 和 A1,也得到了支持。除了内置功能外,开发人员和研究人员还可以实现新的处理模块或管道,或使用提供的命令行界面工具来构建自己的解决方案,应用最合适的技术来识别异构生物医学概念。Neji 在三个具有异构生物医学概念的黄金标准语料库(CRAFT、AnEM 和 NCBI 疾病语料库)上进行了评估,在命名实体识别方面取得了很高的性能(重叠匹配的 F1 度量:物种 95%、细胞 92%、细胞成分 83%、基因和蛋白质 76%、化学物质 65%、生物过程和分子功能 63%、疾病 85%、解剖实体 82%)和实体规范化(重叠名称匹配的 F1 度量和包含在标识符返回列表中的正确标识符:物种 88%、细胞 71%、细胞成分 72%、基因和蛋白质 64%、化学物质 53%、生物过程和分子功能 40%)。Neji 提供快速的多线程数据处理,在使用基于字典的概念识别时,每分钟可注释多达 1200 个句子。

结论

考虑到提供的功能和基本特征,我们认为 Neji 是生物医学社区的一项重要贡献,简化了复杂概念识别解决方案的开发。Neji 可在 http://bioinformatics.ua.pt/neji 上免费获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/86761d1ec797/1471-2105-14-281-10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/76cccfa2060f/1471-2105-14-281-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/a8b60de4db5c/1471-2105-14-281-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/ff3af24ac697/1471-2105-14-281-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/cb164735b125/1471-2105-14-281-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/6c79c03475cb/1471-2105-14-281-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/a4bcfbd603ce/1471-2105-14-281-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/03f9431380b9/1471-2105-14-281-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/08fc2b1664ea/1471-2105-14-281-8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/c9452ee02477/1471-2105-14-281-9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/86761d1ec797/1471-2105-14-281-10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/76cccfa2060f/1471-2105-14-281-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/a8b60de4db5c/1471-2105-14-281-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/ff3af24ac697/1471-2105-14-281-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/cb164735b125/1471-2105-14-281-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/6c79c03475cb/1471-2105-14-281-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/a4bcfbd603ce/1471-2105-14-281-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/03f9431380b9/1471-2105-14-281-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/08fc2b1664ea/1471-2105-14-281-8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/c9452ee02477/1471-2105-14-281-9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c19/3849280/86761d1ec797/1471-2105-14-281-10.jpg

相似文献

1
A modular framework for biomedical concept recognition.生物医学概念识别的模块化框架。
BMC Bioinformatics. 2013 Sep 24;14:281. doi: 10.1186/1471-2105-14-281.
2
NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库:一种用于疾病名称识别和概念规范化的资源。
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.
3
Concept annotation in the CRAFT corpus.概念标注在 CRAFT 语料库中。
BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.
4
Development of an information retrieval tool for biomedical patents.生物医学专利信息检索工具的开发。
Comput Methods Programs Biomed. 2018 Jun;159:125-134. doi: 10.1016/j.cmpb.2018.03.012. Epub 2018 Mar 14.
5
Gimli: open source and high-performance biomedical name recognition.金雳:开源的高性能生物医学命名实体识别。
BMC Bioinformatics. 2013 Feb 15;14:54. doi: 10.1186/1471-2105-14-54.
6
Distinguishing the species of biomedical named entities for term identification.区分生物医学命名实体的物种以进行术语识别。
BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S6. doi: 10.1186/1471-2105-9-S11-S6.
7
SimConcept: a hybrid approach for simplifying composite named entities in biomedical text.SimConcept:一种简化生物医学文本中复合命名实体的混合方法。
IEEE J Biomed Health Inform. 2015 Jul;19(4):1385-91. doi: 10.1109/JBHI.2015.2422651. Epub 2015 Apr 13.
8
NOBLE - Flexible concept recognition for large-scale biomedical natural language processing.NOBLE——用于大规模生物医学自然语言处理的灵活概念识别
BMC Bioinformatics. 2016 Jan 14;17:32. doi: 10.1186/s12859-015-0871-y.
9
Biomedical and clinical English model packages for the Stanza Python NLP library.适用于Stanza Python自然语言处理库的生物医学和临床英语模型包。
J Am Med Inform Assoc. 2021 Aug 13;28(9):1892-1899. doi: 10.1093/jamia/ocab090.
10
Using rule-based natural language processing to improve disease normalization in biomedical text.基于规则的自然语言处理在生物医学文本疾病标准化中的应用。
J Am Med Inform Assoc. 2013 Sep-Oct;20(5):876-81. doi: 10.1136/amiajnl-2012-001173. Epub 2012 Oct 6.

引用本文的文献

1
An Accurate and Efficient Approach to Knowledge Extraction from Scientific Publications Using Structured Ontology Models, Graph Neural Networks, and Large Language Models.利用结构化本体模型、图神经网络和大型语言模型从科学出版物中进行准确高效的知识提取。
Int J Mol Sci. 2024 Nov 3;25(21):11811. doi: 10.3390/ijms252111811.
2
Federated analysis of autosomal recessive coding variants in 29,745 developmental disorder patients from diverse populations.在来自不同人群的 29745 名发育障碍患者中对常染色体隐性编码变异进行联合分析。
Nat Genet. 2024 Oct;56(10):2046-2053. doi: 10.1038/s41588-024-01910-8. Epub 2024 Sep 23.
3
An automatic hypothesis generation for plausible linkage between xanthium and diabetes.

本文引用的文献

1
BeCAS: biomedical concept recognition services and visualization.BeCAS:生物医学概念识别服务和可视化。
Bioinformatics. 2013 Aug 1;29(15):1915-6. doi: 10.1093/bioinformatics/btt317. Epub 2013 Jun 4.
2
Gimli: open source and high-performance biomedical name recognition.金雳:开源的高性能生物医学命名实体识别。
BMC Bioinformatics. 2013 Feb 15;14:54. doi: 10.1186/1471-2105-14-54.
3
A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.
自动生成黄麻与糖尿病之间可能存在关联的假设。
Sci Rep. 2022 Oct 20;12(1):17547. doi: 10.1038/s41598-022-20752-0.
4
Parallel sequence tagging for concept recognition.并行序列标注用于概念识别。
BMC Bioinformatics. 2022 Mar 24;22(Suppl 1):623. doi: 10.1186/s12859-021-04511-y.
5
MedTAG: a portable and customizable annotation tool for biomedical documents.MedTAG:一个用于生物医学文档的可移植和可定制的注释工具。
BMC Med Inform Decis Mak. 2021 Dec 18;21(1):352. doi: 10.1186/s12911-021-01706-4.
6
Extraction of Family History Information From Clinical Notes: Deep Learning and Heuristics Approach.从临床记录中提取家族病史信息:深度学习与启发式方法。
JMIR Med Inform. 2020 Dec 29;8(12):e22898. doi: 10.2196/22898.
7
Gold-standard ontology-based anatomical annotation in the CRAFT Corpus.CRAFT语料库中基于金标准本体的解剖学标注
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax087.
8
Extraction of chemical-protein interactions from the literature using neural networks and narrow instance representation.利用神经网络和狭义实例表示从文献中提取化学-蛋白质相互作用。
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz095.
9
Towards reliable named entity recognition in the biomedical domain.迈向生物医学领域可靠的命名实体识别
Bioinformatics. 2020 Jan 1;36(1):280-286. doi: 10.1093/bioinformatics/btz504.
10
Configurable web-services for biomedical document annotation.用于生物医学文档注释的可配置网络服务。
J Cheminform. 2018 Dec 21;10(1):68. doi: 10.1186/s13321-018-0317-4.
语料库全文期刊文章是一种强大的评估工具,可用于揭示生物医学自然语言处理工具性能的差异。
BMC Bioinformatics. 2012 Aug 17;13:207. doi: 10.1186/1471-2105-13-207.
4
Concept annotation in the CRAFT corpus.概念标注在 CRAFT 语料库中。
BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.
5
A SNPshot of PubMed to associate genetic variants with drugs, diseases, and adverse reactions.从 PubMed 中提取 SNP 以关联遗传变异与药物、疾病和不良反应。
J Biomed Inform. 2012 Oct;45(5):842-50. doi: 10.1016/j.jbi.2012.04.006. Epub 2012 Apr 30.
6
Harmonization of gene/protein annotations: towards a gold standard MEDLINE.基因/蛋白质注释的协调:迈向 MEDLINE 的黄金标准。
Bioinformatics. 2012 May 1;28(9):1253-61. doi: 10.1093/bioinformatics/bts125. Epub 2012 Mar 13.
7
The gene normalization task in BioCreative III.BioCreative III 中的基因标准化任务。
BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S2. doi: 10.1186/1471-2105-12-S8-S2.
8
The BioLexicon: a large-scale terminological resource for biomedical text mining.生物词典:一个用于生物医学文本挖掘的大规模术语资源。
BMC Bioinformatics. 2011 Oct 12;12:397. doi: 10.1186/1471-2105-12-397.
9
Discovering and visualizing indirect associations between biomedical concepts.发现和可视化生物医学概念之间的间接关联。
Bioinformatics. 2011 Jul 1;27(13):i111-9. doi: 10.1093/bioinformatics/btr214.
10
Comparing and combining chunkers of biomedical text.比较和组合生物医学文本的分词器。
J Biomed Inform. 2011 Apr;44(2):354-60. doi: 10.1016/j.jbi.2010.10.005. Epub 2010 Nov 4.