• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基因/蛋白质注释的协调:迈向 MEDLINE 的黄金标准。

Harmonization of gene/protein annotations: towards a gold standard MEDLINE.

机构信息

University of Aveiro, IEETA/DETI, Campus Universitário de Santiago, Aveiro, Portugal.

出版信息

Bioinformatics. 2012 May 1;28(9):1253-61. doi: 10.1093/bioinformatics/bts125. Epub 2012 Mar 13.

DOI:10.1093/bioinformatics/bts125
PMID:22419783
Abstract

MOTIVATION

The recognition of named entities (NER) is an elementary task in biomedical text mining. A number of NER solutions have been proposed in recent years, taking advantage of available annotated corpora, terminological resources and machine-learning techniques. Currently, the best performing solutions combine the outputs from selected annotation solutions measured against a single corpus. However, little effort has been spent on a systematic analysis of methods harmonizing the annotation results and measuring against a combination of Gold Standard Corpora (GSCs).

RESULTS

We present Totum, a machine learning solution that harmonizes gene/protein annotations provided by heterogeneous NER solutions. It has been optimized and measured against a combination of manually curated GSCs. The performed experiments show that our approach improves the F-measure of state-of-the-art solutions by up to 10% (achieving ≈70%) in exact alignment and 22% (achieving ≈82%) in nested alignment. We demonstrate that our solution delivers reliable annotation results across the GSCs and it is an important contribution towards a homogeneous annotation of MEDLINE abstracts.

AVAILABILITY AND IMPLEMENTATION

Totum is implemented in Java and its resources are available at http://bioinformatics.ua.pt/totum

摘要

动机

命名实体识别(NER)是生物医学文本挖掘中的基本任务。近年来,利用可用的带注释语料库、术语资源和机器学习技术,已经提出了许多 NER 解决方案。目前,性能最好的解决方案是结合针对单个语料库的选定注释解决方案的输出。然而,很少有人致力于系统地分析协调注释结果并针对组合的黄金标准语料库(GSCs)进行测量的方法。

结果

我们提出了 Totum,这是一种机器学习解决方案,可协调来自异构 NER 解决方案的基因/蛋白质注释。它已经针对人工编辑的 GSCs 进行了优化和测量。所进行的实验表明,我们的方法可以将最先进解决方案的 F 度量提高多达 10%(达到≈70%)的精确对齐,22%(达到≈82%)的嵌套对齐。我们证明了我们的解决方案可以在 GSCs 之间提供可靠的注释结果,这是对 MEDLINE 摘要进行统一注释的重要贡献。

可用性和实现

Totum 是用 Java 实现的,其资源可在 http://bioinformatics.ua.pt/totum 上获得。

相似文献

1
Harmonization of gene/protein annotations: towards a gold standard MEDLINE.基因/蛋白质注释的协调:迈向 MEDLINE 的黄金标准。
Bioinformatics. 2012 May 1;28(9):1253-61. doi: 10.1093/bioinformatics/bts125. Epub 2012 Mar 13.
2
NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库:一种用于疾病名称识别和概念规范化的资源。
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.
3
Assessment of NER solutions against the first and second CALBC Silver Standard Corpus.针对首个和第二个CALBC银标准语料库对命名实体识别解决方案进行评估。
J Biomed Semantics. 2011 Oct 6;2 Suppl 5(Suppl 5):S11. doi: 10.1186/2041-1480-2-S5-S11.
4
Boosting drug named entity recognition using an aggregate classifier.使用聚合分类器提升药物命名实体识别
Artif Intell Med. 2015 Oct;65(2):145-53. doi: 10.1016/j.artmed.2015.05.007. Epub 2015 Jun 17.
5
TaggerOne: joint named entity recognition and normalization with semi-Markov Models.TaggerOne:使用半马尔可夫模型进行联合命名实体识别与归一化
Bioinformatics. 2016 Sep 15;32(18):2839-46. doi: 10.1093/bioinformatics/btw343. Epub 2016 Jun 9.
6
GENETAG: a tagged corpus for gene/protein named entity recognition.GENETAG:一个用于基因/蛋白质命名实体识别的带标注语料库。
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2105-6-S1-S3. Epub 2005 May 24.
7
Transfer learning for biomedical named entity recognition with neural networks.基于神经网络的生物医学命名实体识别的迁移学习。
Bioinformatics. 2018 Dec 1;34(23):4087-4094. doi: 10.1093/bioinformatics/bty449.
8
A framework for semisupervised feature generation and its applications in biomedical literature mining.半监督特征生成框架及其在生物医学文献挖掘中的应用。
IEEE/ACM Trans Comput Biol Bioinform. 2011 Mar-Apr;8(2):294-307. doi: 10.1109/TCBB.2010.99.
9
Investigating heterogeneous protein annotations toward cross-corpora utilization.研究跨语料库利用的异构蛋白质注释。
BMC Bioinformatics. 2009 Dec 9;10:403. doi: 10.1186/1471-2105-10-403.
10
Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称:一种机器学习方法。
Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

引用本文的文献

1
AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature.AMELIE 通过将患者的表型和基因型与原始文献相匹配,加速孟德尔遗传病的诊断。
Sci Transl Med. 2020 May 20;12(544). doi: 10.1126/scitranslmed.aau9113.
2
Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems.使用本体论对表型进行注释:自然语言处理系统的培训和评估的黄金标准。
Database (Oxford). 2018 Jan 1;2018:bay110. doi: 10.1093/database/bay110.
3
A gene-phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach.
使用表示学习方法从生物医学文献中提取基因-表型关系的管道。
Bioinformatics. 2018 Jul 1;34(13):i386-i394. doi: 10.1093/bioinformatics/bty263.
4
A document processing pipeline for annotating chemical entities in scientific documents.用于在科学文献中标记化学实体的文档处理管道。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S7. doi: 10.1186/1758-2946-7-S1-S7. eCollection 2015.
5
Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources.根据基因/蛋白质标记解决方案和词汇资源评估金标准语料库。
J Biomed Semantics. 2013 Oct 11;4(1):28. doi: 10.1186/2041-1480-4-28.
6
A modular framework for biomedical concept recognition.生物医学概念识别的模块化框架。
BMC Bioinformatics. 2013 Sep 24;14:281. doi: 10.1186/1471-2105-14-281.
7
Gimli: open source and high-performance biomedical name recognition.金雳:开源的高性能生物医学命名实体识别。
BMC Bioinformatics. 2013 Feb 15;14:54. doi: 10.1186/1471-2105-14-54.