基于 CRF 的生物医学文献中化学实体提及识别系统。

A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature.

机构信息

Information Technology Supporting Center, Institute of Scientific and Technical Information of China, No. 15 Fuxing Rd., Haidian District, 100038 Beijing, PR China.

School of Economics and Management, Beijing Forestry University, No. 35 Qinghua East Rd., Haidian District, 100083 Beijing, PR China.

出版信息

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S11. doi: 10.1186/1758-2946-7-S1-S11. eCollection 2015.

DOI:10.1186/1758-2946-7-S1-S11

PMID:25810768

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4331687/

Abstract

BACKGROUND

In order to improve information access on chemical compounds and drugs (chemical entities) described in text repositories, it is very crucial to be able to identify chemical entity mentions (CEMs) automatically within text. The CHEMDNER challenge in BioCreative IV was specially designed to promote the implementation of corresponding systems that are able to detect mentions of chemical compounds and drugs, which has two subtasks: CDI (Chemical Document Indexing) and CEM.

RESULTS

Our system processing pipeline consists of three major components: pre-processing (sentence detection, tokenization), recognition (CRF-based approach), and post-processing (rule-based approach and format conversion). In our post-challenge system, the cost parameter in CRF model was optimized by 10-fold cross validation with grid search, and word representations feature induced by Brown clustering method was introduced. For the CEM subtask, our official runs were ranked in top position by obtaining maximum 88.79% precision, 69.08% recall and 77.70% balanced F-measure, which were improved further to 88.43% precision, 76.48% recall and 82.02% balanced F-measure in our post-challenge system.

CONCLUSIONS

In our system, instead of extracting a CEM as a whole, we regarded it as a sequence labeling problem. Though our current system has much room for improvement, our system is valuable in showing that the performance in term of balanced F-measure can be improved largely by utilizing large amounts of relatively inexpensive un-annotated PubMed abstracts and optimizing the cost parameter in CRF model. From our practice and lessons, if one directly utilizes some open-source natural language processing (NLP) toolkits, such as OpenNLP, Standford CoreNLP, false positive (FP) rate may be very high. It is better to develop some additional rules to minimize the FP rate if one does not want to re-train the related models. Our CEM recognition system is available at: http://www.SciTeMiner.org/XuShuo/Demo/CEM.

摘要

背景

为了提高文本知识库中描述的化合物和药物（化学实体）的信息获取能力，能够自动识别文本中的化学实体提及（CEM）是非常关键的。BioCreative IV 中的 CHEMDNER 挑战赛专门旨在促进实施能够检测化合物和药物提及的相应系统，该挑战赛有两个子任务：化学文档索引（CDI）和 CEM。

结果

我们的系统处理流程由三个主要组件组成：预处理（句子检测、标记化）、识别（基于条件随机场的方法）和后处理（基于规则的方法和格式转换）。在我们的赛后系统中，通过 10 倍交叉验证和网格搜索优化了 CRF 模型中的成本参数，并引入了由 Brown 聚类方法诱导的词表示特征。对于 CEM 子任务，我们的官方运行在获得最大 88.79%精度、69.08%召回率和 77.70%平衡 F 度量的排名中处于领先地位，在我们的赛后系统中进一步提高到 88.43%精度、76.48%召回率和 82.02%平衡 F 度量。

结论

在我们的系统中，我们不是将 CEM 作为一个整体提取，而是将其视为序列标记问题。尽管我们当前的系统还有很大的改进空间，但我们的系统在利用大量相对廉价的未注释 PubMed 摘要并优化 CRF 模型中的成本参数方面，在平衡 F 度量方面的性能有了很大的提高。从我们的实践和经验中可以看出，如果直接利用一些开源的自然语言处理（NLP）工具包，如 OpenNLP、Standford CoreNLP，假阳性（FP）率可能会非常高。如果不想重新训练相关模型，最好开发一些额外的规则来最小化 FP 率。我们的 CEM 识别系统可在以下网址获取：http://www.SciTeMiner.org/XuShuo/Demo/CEM。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe19/4331687/c3d53be77927/1758-2946-7-S1-S11-1.jpg

相似文献

A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature.基于 CRF 的生物医学文献中化学实体提及识别系统。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S11. doi: 10.1186/1758-2946-7-S1-S11. eCollection 2015.

A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature.条件随机场与结构化支持向量机在生物医学文献中化学实体识别的比较。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S8. doi: 10.1186/1758-2946-7-S1-S8. eCollection 2015.

Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations.利用词向量将领域知识融入化学和生物医学命名实体识别。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S9. doi: 10.1186/1758-2946-7-S1-S9. eCollection 2015.

CHEMDNER system with mixed conditional random fields and multi-scale word clustering.CHEMDNER 系统，混合条件随机场和多尺度词聚类。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S4. doi: 10.1186/1758-2946-7-S1-S4. eCollection 2015.

CHEMDNER: The drugs and chemical names extraction challenge.CHEMDNER：药物和化学名称提取挑战赛。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.

tmChem: a high performance approach for chemical named entity recognition and normalization.tmChem：一种用于化学命名实体识别和标准化的高性能方法。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S3. doi: 10.1186/1758-2946-7-S1-S3. eCollection 2015.

Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.基于领域知识和无监督特征学习的专利中化学命名实体识别

Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.化学实体识别：基于词典和基于语法的方法相结合。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S10. doi: 10.1186/1758-2946-7-S1-S10. eCollection 2015.

The CHEMDNER corpus of chemicals and drugs and its annotation principles.CHEMDNER 化学物质和药物语料库及其标注原则。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2. doi: 10.1186/1758-2946-7-S1-S2. eCollection 2015.

A document processing pipeline for annotating chemical entities in scientific documents.用于在科学文献中标记化学实体的文档处理管道。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S7. doi: 10.1186/1758-2946-7-S1-S7. eCollection 2015.

引用本文的文献

Is metadata of articles about COVID-19 enough for multilabel topic classification task?关于 COVID-19 的文章的元数据是否足以完成多标签主题分类任务？

Database (Oxford). 2024 Oct 21;2024. doi: 10.1093/database/baae106.

Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies.从科学出版物文本中自动提取信息：对HIV治疗策略的见解

Front Genet. 2020 Dec 22;11:618862. doi: 10.3389/fgene.2020.618862. eCollection 2020.

LSTMVoter: chemical named entity recognition using a conglomerate of sequence labeling tools.LSTMVoter：使用序列标注工具集合进行化学命名实体识别。

J Cheminform. 2019 Jan 10;11(1):3. doi: 10.1186/s13321-018-0327-2.

Putting hands to rest: efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules.将手工操作搁置一旁：用于化学命名实体识别的高效深度卷积神经网络-循环神经网络架构，无需手工规则。

J Cheminform. 2018 May 23;10(1):28. doi: 10.1186/s13321-018-0280-0.

CHEMDNER: The drugs and chemical names extraction challenge.CHEMDNER：药物和化学名称提取挑战赛。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.

本文引用的文献

Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications.文本挖掘在药物和化学化合物中的应用：方法、工具和应用。

Mol Inform. 2011 Jun;30(6-7):506-19. doi: 10.1002/minf.201100005. Epub 2011 Jul 12.

The CHEMDNER corpus of chemicals and drugs and its annotation principles.CHEMDNER 化学物质和药物语料库及其标注原则。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2. doi: 10.1186/1758-2946-7-S1-S2. eCollection 2015.

CHEMDNER: The drugs and chemical names extraction challenge.CHEMDNER：药物和化学名称提取挑战赛。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.

Chemical named entities recognition: a review on approaches and applications.化学命名实体识别：方法与应用综述

J Cheminform. 2014 Apr 28;6:17. doi: 10.1186/1758-2946-6-17. eCollection 2014.

tmVar: a text mining approach for extracting sequence variants in biomedical literature.tmVar：一种从生物医学文献中提取序列变异的文本挖掘方法。

Bioinformatics. 2013 Jun 1;29(11):1433-9. doi: 10.1093/bioinformatics/btt156. Epub 2013 Apr 5.

Building disease-specific drug-protein connectivity maps from molecular interaction networks and PubMed abstracts.从分子相互作用网络和PubMed摘要构建疾病特异性药物-蛋白质连接图谱。

PLoS Comput Biol. 2009 Jul;5(7):e1000450. doi: 10.1371/journal.pcbi.1000450. Epub 2009 Jul 31.

Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge.生物学文本挖掘系统评估：第二届生物创意社区挑战赛概述

Genome Biol. 2008;9 Suppl 2(Suppl 2):S1. doi: 10.1186/gb-2008-9-s2-s1. Epub 2008 Sep 1.

Identifying gene and protein mentions in text using conditional random fields.使用条件随机场识别文本中的基因和蛋白质提及。

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S6. doi: 10.1186/1471-2105-6-S1-S6. Epub 2005 May 24.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于 CRF 的生物医学文献中化学实体提及识别系统。

A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献