• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于条件随机场和集成抽取器的化学实体抽取。

Chemical entity extraction using CRF and an ensemble of extractors.

机构信息

Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA.

Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA ; Information Sciences and Technology, The Pennsylvania State University, University Park, PA, USA.

出版信息

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S12. doi: 10.1186/1758-2946-7-S1-S12. eCollection 2015.

DOI:10.1186/1758-2946-7-S1-S12
PMID:25810769
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4331688/
Abstract

BACKGROUND

As we are witnessing a great interest in identifying and extracting chemical entities in academic articles, many approaches have been proposed to solve this problem. In this work we describe a probabilistic framework that allows for the output of multiple information extraction systems to be combined in a systematic way. The identified entities are assigned a probability score that reflects the extractors' confidence, without the need for each individual extractor to generate a probability score. We quantitively compared the performance of multiple chemical tokenizers to measure the effect of tokenization on extraction accuracy. Later, a single Conditional Random Fields (CRF) extractor that utilizes the best performing tokenizer is built using a unique collection of features such as word embeddings and Soundex codes, which, to the best of our knowledge, has not been explored in this context before.

RESULTS

The ensemble of multiple extractors outperforms each extractor's individual performance during the CHEMDNER challenge. When the runs were optimized to favor recall, the ensemble approach achieved the second highest recall on unseen entities. As for the single CRF model with novel features, the extractor achieves an F1 score of 83.3% on the test set, without any post processing or abbreviation matching.

CONCLUSIONS

Ensemble information extraction is effective when multiple stand alone extractors are to be used, and produces higher performance than individual off the shelf extractors. The novel features introduced in the single CRF model are sufficient to achieve very competitive F1 score using a simple standalone extractor.

摘要

背景

由于我们对识别和提取学术文章中的化学实体产生了浓厚的兴趣,因此已经提出了许多方法来解决这个问题。在这项工作中,我们描述了一个概率框架,该框架允许以系统的方式组合多个信息提取系统的输出。为识别出的实体分配概率得分,该得分反映了提取器的置信度,而无需每个单独的提取器生成概率得分。我们定量比较了多种化学标记器的性能,以衡量标记化对提取准确性的影响。之后,使用独特的特征集(例如词嵌入和 Soundex 代码)构建了单个利用最佳表现标记器的条件随机场(CRF)提取器,据我们所知,在此之前尚未在这种情况下探索过这些特征。

结果

在 CHEMDNER 挑战赛中,多个提取器的集成在性能上优于每个提取器的单个性能。当优化运行以提高召回率时,集成方法在未见实体上实现了第二高的召回率。对于具有新颖功能的单个 CRF 模型,提取器在测试集上的 F1 得分为 83.3%,而无需进行任何后处理或缩写匹配。

结论

当要使用多个独立的提取器时,集成信息提取是有效的,并且比单个现成的提取器具有更高的性能。在单个 CRF 模型中引入的新颖功能足以使用简单的独立提取器获得非常有竞争力的 F1 得分。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/498a/4331688/fc3f94d4b06a/1758-2946-7-S1-S12-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/498a/4331688/7d3e63bd5b13/1758-2946-7-S1-S12-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/498a/4331688/fc3f94d4b06a/1758-2946-7-S1-S12-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/498a/4331688/7d3e63bd5b13/1758-2946-7-S1-S12-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/498a/4331688/fc3f94d4b06a/1758-2946-7-S1-S12-2.jpg

相似文献

1
Chemical entity extraction using CRF and an ensemble of extractors.基于条件随机场和集成抽取器的化学实体抽取。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S12. doi: 10.1186/1758-2946-7-S1-S12. eCollection 2015.
2
LSTMVoter: chemical named entity recognition using a conglomerate of sequence labeling tools.LSTMVoter:使用序列标注工具集合进行化学命名实体识别。
J Cheminform. 2019 Jan 10;11(1):3. doi: 10.1186/s13321-018-0327-2.
3
CRFVoter: gene and protein related object recognition using a conglomerate of CRF-based tools.CRFVoter:使用基于条件随机场工具集合的基因和蛋白质相关对象识别
J Cheminform. 2019 Mar 14;11(1):21. doi: 10.1186/s13321-019-0343-x.
4
A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature.条件随机场与结构化支持向量机在生物医学文献中化学实体识别的比较。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S8. doi: 10.1186/1758-2946-7-S1-S8. eCollection 2015.
5
Clinical Named Entity Recognition From Chinese Electronic Health Records via Machine Learning Methods.基于机器学习方法的中文电子健康记录临床命名实体识别
JMIR Med Inform. 2018 Dec 17;6(4):e50. doi: 10.2196/medinform.9965.
6
Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.基于领域知识和无监督特征学习的专利中化学命名实体识别
Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.
7
A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature.基于 CRF 的生物医学文献中化学实体提及识别系统。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S11. doi: 10.1186/1758-2946-7-S1-S11. eCollection 2015.
8
CHEMDNER system with mixed conditional random fields and multi-scale word clustering.CHEMDNER 系统,混合条件随机场和多尺度词聚类。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S4. doi: 10.1186/1758-2946-7-S1-S4. eCollection 2015.
9
A deep learning model incorporating part of speech and self-matching attention for named entity recognition of Chinese electronic medical records.基于词性和自匹配注意力的深度学习模型在中文电子病历命名实体识别中的应用。
BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):65. doi: 10.1186/s12911-019-0762-7.
10
Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization.使用代表性标记方案和细粒度标记化增强化学化合物和药物名称识别。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S14. doi: 10.1186/1758-2946-7-S1-S14. eCollection 2015.

引用本文的文献

1
Auto-generated database of semiconductor band gaps using ChemDataExtractor.使用 ChemDataExtractor 自动生成半导体带隙数据库。
Sci Data. 2022 May 3;9(1):193. doi: 10.1038/s41597-022-01294-6.
2
Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies.从科学出版物文本中自动提取信息:对HIV治疗策略的见解
Front Genet. 2020 Dec 22;11:618862. doi: 10.3389/fgene.2020.618862. eCollection 2020.
3
LSTMVoter: chemical named entity recognition using a conglomerate of sequence labeling tools.

本文引用的文献

1
CHEMDNER: The drugs and chemical names extraction challenge.CHEMDNER:药物和化学名称提取挑战赛。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.
2
ChemSpot: a hybrid system for chemical named entity recognition.ChemSpot:一种用于化学命名实体识别的混合系统。
Bioinformatics. 2012 Jun 15;28(12):1633-40. doi: 10.1093/bioinformatics/bts183. Epub 2012 Apr 12.
3
OSCAR4: a flexible architecture for chemical text-mining.OSCAR4:一种用于化学文本挖掘的灵活架构。
LSTMVoter:使用序列标注工具集合进行化学命名实体识别。
J Cheminform. 2019 Jan 10;11(1):3. doi: 10.1186/s13321-018-0327-2.
4
Putting hands to rest: efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules.将手工操作搁置一旁:用于化学命名实体识别的高效深度卷积神经网络-循环神经网络架构,无需手工规则。
J Cheminform. 2018 May 23;10(1):28. doi: 10.1186/s13321-018-0280-0.
5
Recognizing chemicals in patents: a comparative analysis.专利中的化学物质识别:一项比较分析。
J Cheminform. 2016 Oct 28;8:59. doi: 10.1186/s13321-016-0172-0. eCollection 2016.
6
Recognition of chemical entities: combining dictionary-based and grammar-based approaches.化学实体识别:基于词典和基于语法的方法相结合。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S10. doi: 10.1186/1758-2946-7-S1-S10. eCollection 2015.
7
CHEMDNER: The drugs and chemical names extraction challenge.CHEMDNER:药物和化学名称提取挑战赛。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.
J Cheminform. 2011 Oct 14;3(1):41. doi: 10.1186/1758-2946-3-41.
4
A dictionary to identify small molecules and drugs in free text.用于识别自由文本中小分子和药物的词典。
Bioinformatics. 2009 Nov 15;25(22):2983-91. doi: 10.1093/bioinformatics/btp535. Epub 2009 Sep 16.
5
Abbreviation definition identification based on automatic precision estimates.基于自动精度估计的缩写定义识别。
BMC Bioinformatics. 2008 Sep 25;9:402. doi: 10.1186/1471-2105-9-402.
6
BANNER: an executable survey of advances in biomedical named entity recognition.横幅:生物医学命名实体识别进展的可执行调查。
Pac Symp Biocomput. 2008:652-63.