Suppr超能文献

CHEMDNER 系统,混合条件随机场和多尺度词聚类。

CHEMDNER system with mixed conditional random fields and multi-scale word clustering.

机构信息

School of Computer, Wuhan University, Wuhan 430072, China.

School of Public Health, Wuhan University, Wuhan 430072, China.

出版信息

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S4. doi: 10.1186/1758-2946-7-S1-S4. eCollection 2015.

Abstract

BACKGROUND

The chemical compound and drug name recognition plays an important role in chemical text mining, and it is the basis for automatic relation extraction and event identification in chemical information processing. So a high-performance named entity recognition system for chemical compound and drug names is necessary.

METHODS

We developed a CHEMDNER system based on mixed conditional random fields (CRF) with word clustering for chemical compound and drug name recognition. For the word clustering, we used Brown's hierarchical algorithm and Skip-gram model based on deep learning with massive PubMed articles including titles and abstracts.

RESULTS

This system achieved the highest F-score of 88.20% for the CDI task and the second highest F-score of 87.11% for the CEM task in BioCreative IV. The performance was further improved by multi-scale clustering based on deep learning, achieving the F-score of 88.71% for CDI and 88.06% for CEM.

CONCLUSIONS

The mixed CRF model represents both the internal complexity and external contexts of the entities, and the model is integrated with word clustering to capture domain knowledge with PubMed articles including titles and abstracts. The domain knowledge helps to ensure the performance of the entity recognition, even without fine-grained linguistic features and manually designed rules.

摘要

背景

化合物和药物名称识别在化学文本挖掘中起着重要作用,是化学信息处理中自动关系抽取和事件识别的基础。因此,需要开发一种高性能的化合物和药物名称命名实体识别系统。

方法

我们开发了一个基于混合条件随机场(CRF)和单词聚类的 CHEMDNER 系统,用于识别化合物和药物名称。对于单词聚类,我们使用了 Brown 的层次算法和基于深度学习的 Skip-gram 模型,利用包含标题和摘要的大量 PubMed 文章。

结果

在 BioCreative IV 中,该系统在 CDI 任务中获得了 88.20%的最高 F1 分数,在 CEM 任务中获得了 87.11%的第二高 F1 分数。通过基于深度学习的多尺度聚类进一步提高了性能,在 CDI 任务中获得了 88.71%的 F1 分数,在 CEM 任务中获得了 88.06%的 F1 分数。

结论

混合 CRF 模型既表示实体的内部复杂性,又表示外部上下文,该模型与单词聚类相结合,利用包含标题和摘要的 PubMed 文章来捕获领域知识。领域知识有助于确保实体识别的性能,即使没有细粒度的语言特征和手动设计的规则。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a25/4331694/b59a904ba38d/1758-2946-7-S1-S4-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验