利用丰富特征和弱标记数据改进化学疾病关系提取

Improving chemical disease relation extraction with rich features and weakly labeled data.

作者信息

Peng Yifan, Wei Chih-Hsuan, Lu Zhiyong

机构信息

National Center for Biotechnology Information, Bethesda, MD 20894 USA ; Computer and Information Sciences, University of Delaware, Newark, DE 19716 USA.

National Center for Biotechnology Information, Bethesda, MD 20894 USA.

出版信息

J Cheminform. 2016 Oct 7;8:53. doi: 10.1186/s13321-016-0165-z. eCollection 2016.

DOI:10.1186/s13321-016-0165-z

PMID:28316651

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5054544/

Abstract

BACKGROUND

Due to the importance of identifying relations between chemicals and diseases for new drug discovery and improving chemical safety, there has been a growing interest in developing automatic relation extraction systems for capturing these relations from the rich and rapid-growing biomedical literature. In this work we aim to build on current advances in named entity recognition and a recent BioCreative effort to further improve the state of the art in biomedical relation extraction, in particular for the chemical-induced disease (CID) relations.

RESULTS

We propose a rich-feature approach with Support Vector Machine to aid in the extraction of CIDs from PubMed articles. Our feature vector includes novel statistical features, linguistic knowledge, and domain resources. We also incorporate the output of a rule-based system as features, thus combining the advantages of rule- and machine learning-based systems. Furthermore, we augment our approach with automatically generated labeled text from an existing knowledge base to improve performance without additional cost for corpus construction. To evaluate our system, we perform experiments on the human-annotated BioCreative V benchmarking dataset and compare with previous results. When trained using only BioCreative V training and development sets, our system achieves an F-score of 57.51 %, which already compares favorably to previous methods. Our system performance was further improved to 61.01 % in F-score when augmented with additional automatically generated weakly labeled data.

CONCLUSIONS

Our text-mining approach demonstrates state-of-the-art performance in disease-chemical relation extraction. More importantly, this work exemplifies the use of (freely available) curated document-level annotations in existing biomedical databases, which are largely overlooked in text-mining system development.

摘要

背景

由于识别化学物质与疾病之间的关系对于新药研发和提高化学物质安全性至关重要，因此开发自动关系提取系统以从丰富且快速增长的生物医学文献中捕捉这些关系的兴趣日益浓厚。在这项工作中，我们旨在基于命名实体识别的当前进展以及最近的一项生物创意工作，进一步提高生物医学关系提取的技术水平，特别是针对化学诱导疾病（CID）关系。

结果

我们提出了一种基于支持向量机的丰富特征方法，以帮助从PubMed文章中提取CID。我们的特征向量包括新颖的统计特征、语言知识和领域资源。我们还将基于规则的系统的输出作为特征纳入，从而结合了基于规则和机器学习的系统的优点。此外，我们用从现有知识库自动生成的带标签文本增强我们的方法，以提高性能，而无需额外的语料库构建成本。为了评估我们的系统，我们在人工标注的生物创意V基准数据集上进行实验，并与以前的结果进行比较。当仅使用生物创意V训练集和开发集进行训练时，我们的系统获得了57.51%的F值，这已经优于以前的方法。当使用额外的自动生成的弱标签数据进行增强时，我们系统的F值性能进一步提高到61.01%。

结论

我们的文本挖掘方法在疾病 - 化学关系提取方面展示了先进的性能。更重要的是，这项工作例证了在现有生物医学数据库中使用（免费可用的）经过策划的文档级注释，而这些注释在文本挖掘系统开发中很大程度上被忽视了。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e1a/5054544/0b8f3ad6f482/13321_2016_165_Fig1_HTML.jpg

相似文献

Improving chemical disease relation extraction with rich features and weakly labeled data.利用丰富特征和弱标记数据改进化学疾病关系提取

J Cheminform. 2016 Oct 7;8:53. doi: 10.1186/s13321-016-0165-z. eCollection 2016.

Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task.评估生物医学关系抽取的技术现状：生物创意V化学-疾病关系（CDR）任务概述。

Database (Oxford). 2016 Mar 19;2016. doi: 10.1093/database/baw032. Print 2016.

BioCreative V CDR task corpus: a resource for chemical disease relation extraction.生物创意V化学疾病关系提取任务语料库：化学疾病关系提取的资源。

Database (Oxford). 2016 May 9;2016. doi: 10.1093/database/baw068. Print 2016.

Chemical-induced disease relation extraction with various linguistic features.基于多种语言特征的化学诱导疾病关系提取

Database (Oxford). 2016 Apr 6;2016. doi: 10.1093/database/baw042. Print 2016.

A generalizable NLP framework for fast development of pattern-based biomedical relation extraction systems.一种可推广的基于 NLP 的生物医学关系抽取系统的模式快速开发框架。

BMC Bioinformatics. 2014 Aug 23;15(1):285. doi: 10.1186/1471-2105-15-285.

BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics.生物创意 VI 精准医疗轨道系统的性能受到实体识别和语料库特征变化的限制。

Database (Oxford). 2018 Jan 1;2018:bay122. doi: 10.1093/database/bay122.

Sieve-based coreference resolution enhances semi-supervised learning model for chemical-induced disease relation extraction.基于筛法的共指消解增强了用于化学诱导疾病关系提取的半监督学习模型。

Database (Oxford). 2016 Jul;2016. doi: 10.1093/database/baw102.

Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.基于领域知识和无监督特征学习的专利中化学命名实体识别

Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.

BioRED: a rich biomedical relation extraction dataset.BioRED：一个丰富的生物医学关系抽取数据集。

Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac282.

An annotated dataset for extracting gene-melanoma relations from scientific literature.从科学文献中提取基因-黑色素瘤关系的带注释数据集。

J Biomed Semantics. 2022 Jan 19;13(1):2. doi: 10.1186/s13326-021-00251-3.

引用本文的文献

Knowledge discovery of diseases symptoms and rehabilitation measures in Q&A communities.问答社区中疾病症状与康复措施的知识发现

Sci Rep. 2025 Apr 19;15(1):13593. doi: 10.1038/s41598-025-98300-9.

PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge.PubTator 3.0：一款人工智能驱动的文献资源，用于解锁生物医学知识。

Nucleic Acids Res. 2024 Jul 5;52(W1):W540-W546. doi: 10.1093/nar/gkae235.

Biomedical relation extraction with knowledge base-refined weak supervision.基于知识库精炼的弱监督的生物医学关系抽取。

Database (Oxford). 2023 Jul 26;2023. doi: 10.1093/database/baad054.

Disease- and Drug-Related Knowledge Extraction for Health Management from Online Health Communities Based on BERT-BiGRU-ATT.基于 BERT-BiGRU-ATT 的在线健康社区健康管理相关疾病和药物知识提取

Int J Environ Res Public Health. 2022 Dec 9;19(24):16590. doi: 10.3390/ijerph192416590.

The precision medicine process for treating rare disease using the artificial intelligence tool mediKanren.使用人工智能工具mediKanren治疗罕见病的精准医疗过程。

Front Artif Intell. 2022 Sep 30;5:910216. doi: 10.3389/frai.2022.910216. eCollection 2022.

A sequence labeling framework for extracting drug-protein relations from biomedical literature.一种从生物医学文献中提取药物-蛋白质关系的序列标注框架。

Database (Oxford). 2022 Jul 19;2022. doi: 10.1093/database/baac058.

Exploiting document graphs for inter sentence relation extraction.利用文档图进行句子间关系抽取。

J Biomed Semantics. 2022 Jun 3;13(1):15. doi: 10.1186/s13326-022-00267-3.

Identification of Chemical-Disease Associations Through Integration of Molecular Fingerprint, Gene Ontology and Pathway Information.通过整合分子指纹、基因本体和通路信息鉴定化学-疾病关联。

Interdiscip Sci. 2022 Sep;14(3):683-696. doi: 10.1007/s12539-022-00511-5. Epub 2022 Apr 7.

Biomedical relation extraction via knowledge-enhanced reading comprehension.基于知识增强的阅读理解的生物医学关系抽取。

BMC Bioinformatics. 2022 Jan 6;23(1):20. doi: 10.1186/s12859-021-04534-5.

CID-GCN: An Effective Graph Convolutional Networks for Chemical-Induced Disease Relation Extraction.CID-GCN：一种用于化学诱导疾病关系提取的有效图卷积网络

Front Genet. 2021 Feb 10;12:624307. doi: 10.3389/fgene.2021.624307. eCollection 2021.

本文引用的文献

Combining machine learning, crowdsourcing and expert knowledge to detect chemical-induced diseases in text.结合机器学习、众包和专家知识来检测文本中的化学诱导疾病。

Database (Oxford). 2016 Jun 15;2016. doi: 10.1093/database/baw094. Print 2016.

Database (Oxford). 2016 Mar 19;2016. doi: 10.1093/database/baw032. Print 2016.

Crowdsourcing Twitter annotations to identify first-hand experiences of prescription drug use.众包推特注释以识别处方药使用的第一手经验。

J Biomed Inform. 2015 Dec;58:280-287. doi: 10.1016/j.jbi.2015.11.004. Epub 2015 Nov 7.

miRTex: A Text Mining System for miRNA-Gene Relation Extraction.miRTex：一种用于提取miRNA与基因关系的文本挖掘系统。

PLoS Comput Biol. 2015 Sep 25;11(9):e1004391. doi: 10.1371/journal.pcbi.1004391. eCollection 2015.

Using distant supervised learning to identify protein subcellular localizations from full-text scientific articles.利用远程监督学习从全文科学文章中识别蛋白质亚细胞定位。

J Biomed Inform. 2015 Oct;57:134-44. doi: 10.1016/j.jbi.2015.07.013. Epub 2015 Jul 26.

tmChem: a high performance approach for chemical named entity recognition and normalization.tmChem：一种用于化学命名实体识别和标准化的高性能方法。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S3. doi: 10.1186/1758-2946-7-S1-S3. eCollection 2015.

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features.社交媒体中的药物警戒：使用带有词嵌入聚类特征的序列标注挖掘药物不良反应提及信息。

J Am Med Inform Assoc. 2015 May;22(3):671-81. doi: 10.1093/jamia/ocu041. Epub 2015 Mar 9.

Microtask crowdsourcing for disease mention annotation in PubMed abstracts.用于在PubMed摘要中进行疾病提及标注的微任务众包。

Pac Symp Biocomput. 2015:282-93.

Distant supervision for cancer pathway extraction from text.从文本中提取癌症通路的远程监督。

Pac Symp Biocomput. 2015:120-31.

The Comparative Toxicogenomics Database's 10th year anniversary: update 2015.比较毒理基因组学数据库成立十周年：2015年更新

Nucleic Acids Res. 2015 Jan;43(Database issue):D914-20. doi: 10.1093/nar/gku935. Epub 2014 Oct 17.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用丰富特征和弱标记数据改进化学疾病关系提取

Improving chemical disease relation extraction with rich features and weakly labeled data.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献