条件随机场与结构化支持向量机在生物医学文献中化学实体识别的比较。

A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature.

机构信息

Department of Computer Science, Harbin Institute of Technology Shenzhen Guraduate, Shenzhen, Guangdong, China ; School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA.

Department of Pharmacy, the First Affiliated Hospital, Harbin Medical University Harbin, Heilongjiang, China.

出版信息

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S8. doi: 10.1186/1758-2946-7-S1-S8. eCollection 2015.

DOI:10.1186/1758-2946-7-S1-S8

PMID:25810779

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4331698/

Abstract

BACKGROUND

Chemical compounds and drugs (together called chemical entities) embedded in scientific articles are crucial for many information extraction tasks in the biomedical domain. However, only a very limited number of chemical entity recognition systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of chemical entity recognition systems, the Spanish National Cancer Research Center (CNIO) and The University of Navarra organized a challenge on Chemical and Drug Named Entity Recognition (CHEMDNER). The CHEMDNER challenge contains two individual subtasks: 1) Chemical Entity Mention recognition (CEM); and 2) Chemical Document Indexing (CDI). Our study proposes machine learning-based systems for the CEM task.

METHODS

The 2013 CHEMDNER challenge organizers provided a manually annotated 10,000 UTF8-encoded PubMed abstracts according to a predefined annotation guideline: a training set of 3,500 abstracts, a development set of 3,500 abstracts and a test set of 3,000 abstracts. We developed machine learning-based systems, based on conditional random fields (CRF) and structured support vector machines (SSVM) respectively, for the CEM task for this data set. The effects of three types of word representation (WR) features, generated by Brown clustering, random indexing and skip-gram, on both two machine learning-based systems were also investigated. The performance of our system was evaluated on the test set using scripts provided by the CHEMDNER challenge organizers. Primary evaluation measures were micro Precision, Recall, and F-measure.

RESULTS

Our best system was among the top ranked systems with an official micro F-measure of 85.05%. Fixing a bug caused by inconsistent features marginally improved the performance (micro F-measure of 85.20%) of the system.

CONCLUSIONS

The SSVM-based CEM systems outperformed the CRF-based CEM systems when using the same features. Each type of the WR feature was beneficial to the CEM task. Both the CRF-based and SSVM-based systems using the all three types of WR features showed better performance than the systems using only one type of the WR feature.

摘要

背景

科学文献中嵌入的化学化合物和药物（统称为化学实体）对于生物医学领域的许多信息提取任务至关重要。然而，只有非常有限数量的化学实体识别系统是公开可用的，这可能是由于缺乏大型手动标注语料库。为了加速化学实体识别系统的发展，西班牙国家癌症研究中心（CNIO）和纳瓦拉大学组织了一次化学和药物命名实体识别（CHEMDNER）挑战赛。CHEMDNER 挑战赛包含两个独立的子任务：1）化学实体提及识别（CEM）；2）化学文献索引（CDI）。我们的研究提出了基于机器学习的 CEM 任务系统。

方法

2013 年 CHEMDNER 挑战赛组织者根据预定义的标注指南提供了 10000 个手动标注的 UTF8 编码 PubMed 摘要：一个 3500 个摘要的训练集、一个 3500 个摘要的开发集和一个 3000 个摘要的测试集。我们为这个数据集开发了基于条件随机场（CRF）和结构化支持向量机（SSVM）的基于机器学习的系统，用于 CEM 任务。还研究了三种词表示（WR）特征对这两个基于机器学习的系统的影响，这三种 WR 特征分别由 Brown 聚类、随机索引和 skip-gram 生成。使用 CHEMDNER 挑战赛组织者提供的脚本在测试集上评估了我们系统的性能。主要评估指标是微观精度、召回率和 F1 分数。

结果

我们的最佳系统在排名靠前的系统中排名较高，官方的微观 F1 分数为 85.05%。修复一个由特征不一致引起的错误略微提高了系统的性能（微观 F1 分数为 85.20%）。

结论

当使用相同的特征时，基于 SSVM 的 CEM 系统优于基于 CRF 的 CEM 系统。每种 WR 特征类型都对 CEM 任务有益。使用所有三种 WR 特征的 CRF 基于和 SSVM 基于系统的性能均优于仅使用一种 WR 特征的系统。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7673/4331698/033c799af40a/1758-2946-7-S1-S8-1.jpg

相似文献

A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature.条件随机场与结构化支持向量机在生物医学文献中化学实体识别的比较。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S8. doi: 10.1186/1758-2946-7-S1-S8. eCollection 2015.

Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.基于领域知识和无监督特征学习的专利中化学命名实体识别

Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.

A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature.基于 CRF 的生物医学文献中化学实体提及识别系统。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S11. doi: 10.1186/1758-2946-7-S1-S11. eCollection 2015.

Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations.利用词向量将领域知识融入化学和生物医学命名实体识别。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S9. doi: 10.1186/1758-2946-7-S1-S9. eCollection 2015.

CHEMDNER system with mixed conditional random fields and multi-scale word clustering.CHEMDNER 系统，混合条件随机场和多尺度词聚类。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S4. doi: 10.1186/1758-2946-7-S1-S4. eCollection 2015.

CHEMDNER: The drugs and chemical names extraction challenge.CHEMDNER：药物和化学名称提取挑战赛。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.

The CHEMDNER corpus of chemicals and drugs and its annotation principles.CHEMDNER 化学物质和药物语料库及其标注原则。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2. doi: 10.1186/1758-2946-7-S1-S2. eCollection 2015.

Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features.使用带有词表示特征的结构支持向量机识别医院出院小结中的临床实体。

BMC Med Inform Decis Mak. 2013;13 Suppl 1(Suppl 1):S1. doi: 10.1186/1472-6947-13-S1-S1. Epub 2013 Apr 5.

A comprehensive study of named entity recognition in Chinese clinical text.中文临床文本命名实体识别的综合研究。

J Am Med Inform Assoc. 2014 Sep-Oct;21(5):808-14. doi: 10.1136/amiajnl-2013-002381. Epub 2013 Dec 17.

Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization.使用代表性标记方案和细粒度标记化增强化学化合物和药物名称识别。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S14. doi: 10.1186/1758-2946-7-S1-S14. eCollection 2015.

引用本文的文献

Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach.使用朴素贝叶斯分类器方法在科学出版物文本中进行化学命名实体识别。

J Cheminform. 2022 Aug 13;14(1):55. doi: 10.1186/s13321-022-00633-4.

Learning adaptive representations for entity recognition in the biomedical domain.学习生物医学领域中实体识别的自适应表示。

J Biomed Semantics. 2021 May 17;12(1):10. doi: 10.1186/s13326-021-00238-0.

Recognizing software names in biomedical literature using machine learning.使用机器学习识别生物医学文献中的软件名称。

Health Informatics J. 2020 Mar;26(1):21-33. doi: 10.1177/1460458219869490. Epub 2019 Sep 30.

Ensemble-based Methods to Improve De-identification of Electronic Health Record Narratives.基于集成的方法以改善电子健康记录叙述的去识别化

AMIA Annu Symp Proc. 2018 Dec 5;2018:663-672. eCollection 2018.

Putting hands to rest: efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules.将手工操作搁置一旁：用于化学命名实体识别的高效深度卷积神经网络-循环神经网络架构，无需手工规则。

J Cheminform. 2018 May 23;10(1):28. doi: 10.1186/s13321-018-0280-0.

Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.基于领域知识和无监督特征学习的专利中化学命名实体识别

Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.

Feature engineering for drug name recognition in biomedical texts: feature conjunction and feature selection.生物医学文本中药物名称识别的特征工程：特征结合与特征选择

Comput Math Methods Med. 2015;2015:913489. doi: 10.1155/2015/913489. Epub 2015 Mar 12.

CheNER: a tool for the identification of chemical entities and their classes in biomedical literature.CheNER：一个用于在生物医学文献中识别化学实体及其类别的工具。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S15. doi: 10.1186/1758-2946-7-S1-S15. eCollection 2015.

CHEMDNER: The drugs and chemical names extraction challenge.CHEMDNER：药物和化学名称提取挑战赛。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.

本文引用的文献

Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications.文本挖掘在药物和化学化合物中的应用：方法、工具和应用。

Mol Inform. 2011 Jun;30(6-7):506-19. doi: 10.1002/minf.201100005. Epub 2011 Jul 12.

The CHEMDNER corpus of chemicals and drugs and its annotation principles.CHEMDNER 化学物质和药物语料库及其标注原则。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2. doi: 10.1186/1758-2946-7-S1-S2. eCollection 2015.

CHEMDNER: The drugs and chemical names extraction challenge.CHEMDNER：药物和化学名称提取挑战赛。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.

Evaluating word representation features in biomedical named entity recognition tasks.评估生物医学命名实体识别任务中的词表示特征。

Biomed Res Int. 2014;2014:240403. doi: 10.1155/2014/240403. Epub 2014 Mar 6.

DrugBank 4.0: shedding new light on drug metabolism.DrugBank 4.0：揭示药物代谢的新视角。

Nucleic Acids Res. 2014 Jan;42(Database issue):D1091-7. doi: 10.1093/nar/gkt1068. Epub 2013 Nov 6.

A hybrid system for temporal information extraction from clinical text.一种从临床文本中提取时间信息的混合系统。

J Am Med Inform Assoc. 2013 Sep-Oct;20(5):828-35. doi: 10.1136/amiajnl-2013-001635. Epub 2013 Apr 9.

BMC Med Inform Decis Mak. 2013;13 Suppl 1(Suppl 1):S1. doi: 10.1186/1472-6947-13-S1-S1. Epub 2013 Apr 5.

ChemSpot: a hybrid system for chemical named entity recognition.ChemSpot：一种用于化学命名实体识别的混合系统。

Bioinformatics. 2012 Jun 15;28(12):1633-40. doi: 10.1093/bioinformatics/bts183. Epub 2012 Apr 12.

OSCAR4: a flexible architecture for chemical text-mining.OSCAR4：一种用于化学文本挖掘的灵活架构。

J Cheminform. 2011 Oct 14;3(1):41. doi: 10.1186/1758-2946-3-41.

2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.2010 i2b2/VA 挑战赛：临床文本中的概念、断言和关系

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):552-6. doi: 10.1136/amiajnl-2011-000203. Epub 2011 Jun 16.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

条件随机场与结构化支持向量机在生物医学文献中化学实体识别的比较。

A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献