• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用代表性标记方案和细粒度标记化增强化学化合物和药物名称识别。

Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization.

机构信息

Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan.

Institute of Information Science, Academia Sinica, Taipei, Taiwan.

出版信息

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S14. doi: 10.1186/1758-2946-7-S1-S14. eCollection 2015.

DOI:10.1186/1758-2946-7-S1-S14
PMID:25810771
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4331690/
Abstract

BACKGROUND

The functions of chemical compounds and drugs that affect biological processes and their particular effect on the onset and treatment of diseases have attracted increasing interest with the advancement of research in the life sciences. To extract knowledge from the extensive literatures on such compounds and drugs, the organizers of BioCreative IV administered the CHEMical Compound and Drug Named Entity Recognition (CHEMDNER) task to establish a standard dataset for evaluating state-of-the-art chemical entity recognition methods.

METHODS

This study introduces the approach of our CHEMDNER system. Instead of emphasizing the development of novel feature sets for machine learning, this study investigates the effect of various tag schemes on the recognition of the names of chemicals and drugs by using conditional random fields. Experiments were conducted using combinations of different tokenization strategies and tag schemes to investigate the effects of tag set selection and tokenization method on the CHEMDNER task.

RESULTS

This study presents the performance of CHEMDNER of three more representative tag schemes-IOBE, IOBES, and IOB12E-when applied to a widely utilized IOB tag set and combined with the coarse-/fine-grained tokenization methods. The experimental results thus reveal that the fine-grained tokenization strategy performance best in terms of precision, recall and F-scores when the IOBES tag set was utilized. The IOBES model with fine-grained tokenization yielded the best-F-scores in the six chemical entity categories other than the "Multiple" entity category. Nonetheless, no significant improvement was observed when a more representative tag schemes was used with the coarse or fine-grained tokenization rules. The best F-scores that were achieved using the developed system on the test dataset of the CHEMDNER task were 0.833 and 0.815 for the chemical documents indexing and the chemical entity mention recognition tasks, respectively.

CONCLUSIONS

The results herein highlight the importance of tag set selection and the use of different tokenization strategies. Fine-grained tokenization combined with the tag set IOBES most effectively recognizes chemical and drug names. To the best of the authors' knowledge, this investigation is the first comprehensive investigation use of various tag set schemes combined with different tokenization strategies for the recognition of chemical entities.

摘要

背景

随着生命科学研究的进展,影响生物过程的化合物和药物的功能及其对疾病的发生和治疗的特殊作用引起了越来越多的关注。为了从广泛的化合物和药物文献中提取知识,BioCreative IV 的组织者开展了 CHEMical Compound and Drug Named Entity Recognition(CHEMDNER)任务,以建立一个用于评估最先进的化学实体识别方法的标准数据集。

方法

本研究介绍了我们的 CHEMDNER 系统的方法。本研究不是强调为机器学习开发新的特征集,而是调查了各种标记方案对使用条件随机场识别化学物质和药物名称的影响。实验使用不同的标记策略和标记方案组合进行,以研究标记集选择和标记方法对 CHEMDNER 任务的影响。

结果

本研究提出了在广泛使用的 IOB 标记集上应用三种更具代表性的标记方案(IOBE、IOBES 和 IOB12E)时的 CHEMDNER 性能,并结合了粗粒度/细粒度标记方法。实验结果表明,在使用 IOBES 标记集时,细粒度标记策略在精度、召回率和 F 分数方面表现最佳。在“Multiple”实体类别之外的其他六个化学实体类别中,具有细粒度标记的 IOBES 模型获得了最佳的 F 分数。然而,当使用更具代表性的标记方案和粗粒度或细粒度标记规则时,并没有观察到显著的改进。在 CHEMDNER 任务的测试数据集上,开发系统获得的最佳 F 分数分别为化学文献索引任务的 0.833 和化学实体提及识别任务的 0.815。

结论

结果强调了标记集选择和使用不同标记策略的重要性。细粒度标记与 IOBES 标记集相结合,最有效地识别化学物质和药物名称。据作者所知,这是首次使用各种标记集方案与不同的标记策略相结合进行化学实体识别的全面调查。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a16e/4331690/987b12fc624d/1758-2946-7-S1-S14-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a16e/4331690/bde22272dfeb/1758-2946-7-S1-S14-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a16e/4331690/bde34b5bd436/1758-2946-7-S1-S14-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a16e/4331690/d374c50ab8ba/1758-2946-7-S1-S14-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a16e/4331690/987b12fc624d/1758-2946-7-S1-S14-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a16e/4331690/bde22272dfeb/1758-2946-7-S1-S14-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a16e/4331690/bde34b5bd436/1758-2946-7-S1-S14-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a16e/4331690/d374c50ab8ba/1758-2946-7-S1-S14-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a16e/4331690/987b12fc624d/1758-2946-7-S1-S14-4.jpg

相似文献

1
Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization.使用代表性标记方案和细粒度标记化增强化学化合物和药物名称识别。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S14. doi: 10.1186/1758-2946-7-S1-S14. eCollection 2015.
2
Recognition of chemical entities: combining dictionary-based and grammar-based approaches.化学实体识别:基于词典和基于语法的方法相结合。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S10. doi: 10.1186/1758-2946-7-S1-S10. eCollection 2015.
3
CHEMDNER: The drugs and chemical names extraction challenge.CHEMDNER:药物和化学名称提取挑战赛。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.
4
A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature.条件随机场与结构化支持向量机在生物医学文献中化学实体识别的比较。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S8. doi: 10.1186/1758-2946-7-S1-S8. eCollection 2015.
5
CHEMDNER system with mixed conditional random fields and multi-scale word clustering.CHEMDNER 系统,混合条件随机场和多尺度词聚类。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S4. doi: 10.1186/1758-2946-7-S1-S4. eCollection 2015.
6
The CHEMDNER corpus of chemicals and drugs and its annotation principles.CHEMDNER 化学物质和药物语料库及其标注原则。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2. doi: 10.1186/1758-2946-7-S1-S2. eCollection 2015.
7
Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations.利用词向量将领域知识融入化学和生物医学命名实体识别。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S9. doi: 10.1186/1758-2946-7-S1-S9. eCollection 2015.
8
A document processing pipeline for annotating chemical entities in scientific documents.用于在科学文献中标记化学实体的文档处理管道。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S7. doi: 10.1186/1758-2946-7-S1-S7. eCollection 2015.
9
A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature.基于 CRF 的生物医学文献中化学实体提及识别系统。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S11. doi: 10.1186/1758-2946-7-S1-S11. eCollection 2015.
10
Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.基于领域知识和无监督特征学习的专利中化学命名实体识别
Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.

引用本文的文献

1
Towards discovery: an end-to-end system for uncovering novel biomedical relations.探索之路:一个端到端的系统,用于揭示新的生物医学关系。
Database (Oxford). 2024 Jul 11;2024. doi: 10.1093/database/baae057.
2
Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics.使用深度学习和启发式方法在 PubMed 全文文章中进行化学物质的识别和标引。
Database (Oxford). 2022 Jul 1;2022. doi: 10.1093/database/baac047.
3
Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies.

本文引用的文献

1
CHEMDNER: The drugs and chemical names extraction challenge.CHEMDNER:药物和化学名称提取挑战赛。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.
2
Chemical named entities recognition: a review on approaches and applications.化学命名实体识别:方法与应用综述
J Cheminform. 2014 Apr 28;6:17. doi: 10.1186/1758-2946-6-17. eCollection 2014.
3
ChemSpot: a hybrid system for chemical named entity recognition.ChemSpot:一种用于化学命名实体识别的混合系统。
从科学出版物文本中自动提取信息:对HIV治疗策略的见解
Front Genet. 2020 Dec 22;11:618862. doi: 10.3389/fgene.2020.618862. eCollection 2020.
4
NERChem: adapting NERBio to chemical patents via full-token features and named entity feature with chemical sub-class composition.NERChem:通过全词元特征和具有化学子类组成的命名实体特征,使NERBio适用于化学专利。
Database (Oxford). 2016 Oct 25;2016:baw135. doi: 10.1093/database/baw135.
5
MER: a shell script and annotation server for minimal named entity recognition and linking.MER:用于最小命名实体识别与链接的 shell 脚本及注释服务器。
J Cheminform. 2018 Dec 5;10(1):58. doi: 10.1186/s13321-018-0312-9.
6
Putting hands to rest: efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules.将手工操作搁置一旁:用于化学命名实体识别的高效深度卷积神经网络-循环神经网络架构,无需手工规则。
J Cheminform. 2018 May 23;10(1):28. doi: 10.1186/s13321-018-0280-0.
7
Improving the dictionary lookup approach for disease normalization using enhanced dictionary and query expansion.使用增强型词典和查询扩展改进疾病规范化的词典查找方法。
Database (Oxford). 2016 Aug 7;2016. doi: 10.1093/database/baw112. Print 2016.
8
BelSmile: a biomedical semantic role labeling approach for extracting biological expression language from text.BelSmile:一种用于从文本中提取生物表达语言的生物医学语义角色标注方法。
Database (Oxford). 2016 May 12;2016. doi: 10.1093/database/baw064. Print 2016.
9
A context-aware approach for progression tracking of medical concepts in electronic medical records.一种用于电子病历中医学概念进展跟踪的上下文感知方法。
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S150-S157. doi: 10.1016/j.jbi.2015.09.013. Epub 2015 Sep 30.
10
Feature engineering for drug name recognition in biomedical texts: feature conjunction and feature selection.生物医学文本中药物名称识别的特征工程:特征结合与特征选择
Comput Math Methods Med. 2015;2015:913489. doi: 10.1155/2015/913489. Epub 2015 Mar 12.
Bioinformatics. 2012 Jun 15;28(12):1633-40. doi: 10.1093/bioinformatics/bts183. Epub 2012 Apr 12.
4
Understanding PubMed user search behavior through log analysis.通过日志分析了解PubMed用户的搜索行为。
Database (Oxford). 2009;2009:bap018. doi: 10.1093/database/bap018. Epub 2009 Nov 27.
5
Cascaded classifiers for confidence-based chemical named entity recognition.用于基于置信度的化学命名实体识别的级联分类器
BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S4. doi: 10.1186/1471-2105-9-S11-S4.
6
Overview of BioCreative II gene normalization.生物创意II基因标准化概述。
Genome Biol. 2008;9 Suppl 2(Suppl 2):S3. doi: 10.1186/gb-2008-9-s2-s3. Epub 2008 Sep 1.
7
Overview of BioCreative II gene mention recognition.生物创意II基因提及识别概述。
Genome Biol. 2008;9 Suppl 2(Suppl 2):S2. doi: 10.1186/gb-2008-9-s2-s2. Epub 2008 Sep 1.
8
Integrating high dimensional bi-directional parsing models for gene mention tagging.整合用于基因提及标记的高维双向解析模型。
Bioinformatics. 2008 Jul 1;24(13):i286-94. doi: 10.1093/bioinformatics/btn183.
9
Detection of IUPAC and IUPAC-like chemical names.检测国际纯粹与应用化学联合会(IUPAC)及类IUPAC化学名称。
Bioinformatics. 2008 Jul 1;24(13):i268-76. doi: 10.1093/bioinformatics/btn181.
10
BANNER: an executable survey of advances in biomedical named entity recognition.横幅:生物医学命名实体识别进展的可执行调查。
Pac Symp Biocomput. 2008:652-63.