• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

扩展TextAE用于非连续实体的标注。

Extending TextAE for annotation of non-contiguous entities.

作者信息

Lever Jake, Altman Russ, Kim Jin-Dong

机构信息

Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA.

Database Center for Life Science, Research Organization of Information and Systems, Kashiwa 277-0871, Japan.

出版信息

Genomics Inform. 2020 Jun;18(2):e15. doi: 10.5808/GI.2020.18.2.e15. Epub 2020 Jun 15.

DOI:10.5808/GI.2020.18.2.e15
PMID:32634869
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7362949/
Abstract

Named entity recognition tools are used to identify mentions of biomedical entities in free text and are essential components of high-quality information retrieval and extraction systems. Without good entity recognition, methods will mislabel searched text and will miss important information or identify spurious text that will frustrate users. Most tools do not capture non-contiguous entities which are separate spans of text that together refer to an entity, e.g., the entity "type 1 diabetes" in the phrase "type 1 and type 2 diabetes." This type is commonly found in biomedical texts, especially in lists, where multiple biomedical entities are named in shortened form to avoid repeating words. Most text annotation systems, that enable users to view and edit entity annotations, do not support non-contiguous entities. Therefore, experts cannot even visualize non-contiguous entities, let alone annotate them to build valuable datasets for machine learning methods. To combat this problem and as part of the BLAH6 hackathon, we extended the TextAE platform to allow visualization and annotation of non-contiguous entities. This enables users to add new subspans to existing entities by selecting additional text. We integrate this new functionality with TextAE's existing editing functionality to allow easy changes to entity annotation and editing of relation annotations involving non-contiguous entities, with importing and exporting to the PubAnnotation format. Finally, we roughly quantify the problem across the entire accessible biomedical literature to highlight that there are a substantial number of non-contiguous entities that appear in lists that would be missed by most text mining systems.

摘要

命名实体识别工具用于在自由文本中识别生物医学实体的提及,是高质量信息检索和提取系统的重要组成部分。没有良好的实体识别,方法会错误标记搜索文本,错过重要信息或识别会让用户沮丧的虚假文本。大多数工具无法捕获非连续实体,即文本中一起指代一个实体的不同跨度,例如短语“1型和2型糖尿病”中的实体“1型糖尿病”。这种类型在生物医学文本中很常见,尤其是在列表中,其中多个生物医学实体以缩写形式命名以避免重复词语。大多数允许用户查看和编辑实体注释的文本注释系统不支持非连续实体。因此,专家甚至无法可视化非连续实体,更不用说对其进行注释以构建用于机器学习方法的有价值数据集了。为了解决这个问题并作为BLAH6黑客马拉松的一部分,我们扩展了TextAE平台以允许对非连续实体进行可视化和注释。这使用户能够通过选择额外的文本为现有实体添加新的子跨度。我们将此新功能与TextAE现有的编辑功能集成,以便轻松更改实体注释并编辑涉及非连续实体的关系注释,并可导入和导出为PubAnnotation格式。最后,我们大致量化了整个可访问生物医学文献中的这个问题,以突出显示列表中存在大量大多数文本挖掘系统会遗漏的非连续实体。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7b04/7362949/34109afb60c9/gi-2020-18-2-e15f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7b04/7362949/0ef3eaec0be7/gi-2020-18-2-e15f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7b04/7362949/528029facba6/gi-2020-18-2-e15f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7b04/7362949/34109afb60c9/gi-2020-18-2-e15f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7b04/7362949/0ef3eaec0be7/gi-2020-18-2-e15f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7b04/7362949/528029facba6/gi-2020-18-2-e15f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7b04/7362949/34109afb60c9/gi-2020-18-2-e15f3.jpg

相似文献

1
Extending TextAE for annotation of non-contiguous entities.扩展TextAE用于非连续实体的标注。
Genomics Inform. 2020 Jun;18(2):e15. doi: 10.5808/GI.2020.18.2.e15. Epub 2020 Jun 15.
2
Using the PubAnnotation ecosystem to perform agile text mining on Genomics & Informatics: a tutorial review.利用PubAnnotation生态系统对基因组学与信息学进行敏捷文本挖掘:教程综述
Genomics Inform. 2020 Jun;18(2):e13. doi: 10.5808/GI.2020.18.2.e13. Epub 2020 Jun 16.
3
FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining.FamPlex:生物医学文本挖掘中人类蛋白质家族和复合物的实体识别和关系解析资源。
BMC Bioinformatics. 2018 Jun 28;19(1):248. doi: 10.1186/s12859-018-2211-5.
4
Annotating patient clinical records with syntactic chunks and named entities: the Harvey Corpus.用句法块和命名实体标注患者临床记录:哈维语料库。
Lang Resour Eval. 2016;50:523-548. doi: 10.1007/s10579-015-9330-7. Epub 2016 Jan 11.
5
Active learning for ontological event extraction incorporating named entity recognition and unknown word handling.结合命名实体识别和未知词处理的本体事件抽取的主动学习
J Biomed Semantics. 2016 Apr 27;7:22. doi: 10.1186/s13326-016-0059-z. eCollection 2016.
6
A Text Mining Pipeline Using Active and Deep Learning Aimed at Curating Information in Computational Neuroscience.使用主动和深度学习的文本挖掘管道,旨在为计算神经科学中的信息提供支持。
Neuroinformatics. 2019 Jul;17(3):391-406. doi: 10.1007/s12021-018-9404-y.
7
Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: an annotation and machine learning study.从临床文本中自动识别疾病、检查结果、药物和身体结构:一项注释与机器学习研究。
J Biomed Inform. 2014 Jun;49:148-58. doi: 10.1016/j.jbi.2014.01.012. Epub 2014 Feb 4.
8
OryzaGP: rice gene and protein dataset for named-entity recognition.OryzaGP:用于命名实体识别的水稻基因和蛋白质数据集。
Genomics Inform. 2019 Jun;17(2):e17. doi: 10.5808/GI.2019.17.2.e17. Epub 2019 Jun 26.
9
Unsupervised biomedical named entity recognition: experiments with clinical and biological texts.无监督生物医学命名实体识别:临床和生物文本实验。
J Biomed Inform. 2013 Dec;46(6):1088-98. doi: 10.1016/j.jbi.2013.08.004. Epub 2013 Aug 15.
10
MER: a shell script and annotation server for minimal named entity recognition and linking.MER:用于最小命名实体识别与链接的 shell 脚本及注释服务器。
J Cheminform. 2018 Dec 5;10(1):58. doi: 10.1186/s13321-018-0312-9.

引用本文的文献

1
From literature to biodiversity data: mining arthropod organismal traits with machine learning.从文献到生物多样性数据:利用机器学习挖掘节肢动物的机体特征
Biodivers Data J. 2025 Aug 5;13:e153070. doi: 10.3897/BDJ.13.e153070. eCollection 2025.

本文引用的文献

1
An extensive review of tools for manual annotation of documents.对文档手动标注工具的全面回顾。
Brief Bioinform. 2021 Jan 18;22(1):146-163. doi: 10.1093/bib/bbz130.
2
PGxMine: Text mining for curation of PharmGKB.PGxMine:用于 PharmGKB 策管的文本挖掘。
Pac Symp Biocomput. 2020;25:611-622.
3
PubTator central: automated concept annotation for biomedical full text articles.PubTator 中心:用于生物医学全文文章的自动概念标注。
Nucleic Acids Res. 2019 Jul 2;47(W1):W587-W593. doi: 10.1093/nar/gkz389.
4
CancerMine: a literature-mined resource for drivers, oncogenes and tumor suppressors in cancer.癌症基因库:一个从文献中挖掘出的癌症相关驱动基因、致癌基因和抑癌基因的资源。
Nat Methods. 2019 Jun;16(6):505-507. doi: 10.1038/s41592-019-0422-y. Epub 2019 May 20.
5
STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets.STRING v11:具有增强覆盖范围的蛋白质-蛋白质相互作用网络,支持在全基因组实验数据集的功能发现。
Nucleic Acids Res. 2019 Jan 8;47(D1):D607-D613. doi: 10.1093/nar/gky1131.
6
tmChem: a high performance approach for chemical named entity recognition and normalization.tmChem:一种用于化学命名实体识别和标准化的高性能方法。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S3. doi: 10.1186/1758-2946-7-S1-S3. eCollection 2015.
7
DNorm: disease name normalization with pairwise learning to rank.DNorm:基于对分学习排序的疾病名称标准化。
Bioinformatics. 2013 Nov 15;29(22):2909-17. doi: 10.1093/bioinformatics/btt474. Epub 2013 Aug 21.
8
BANNER: an executable survey of advances in biomedical named entity recognition.横幅:生物医学命名实体识别进展的可执行调查。
Pac Symp Biocomput. 2008:652-63.