Suppr超能文献

基于筛法的共指消解增强了用于化学诱导疾病关系提取的半监督学习模型。

Sieve-based coreference resolution enhances semi-supervised learning model for chemical-induced disease relation extraction.

作者信息

Le Hoang-Quynh, Tran Mai-Vu, Dang Thanh Hai, Ha Quang-Thuy, Collier Nigel

机构信息

Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam. Building E3, 144 Xuan Thuy str., Cau Giay dist., Hanoi, Vietnam. Postal code: 100000.

Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam. Building E3, 144 Xuan Thuy str., Cau Giay dist., Hanoi, Vietnam. Postal code: 100000

出版信息

Database (Oxford). 2016 Jul;2016. doi: 10.1093/database/baw102.

Abstract

The BioCreative V chemical-disease relation (CDR) track was proposed to accelerate the progress of text mining in facilitating integrative understanding of chemicals, diseases and their relations. In this article, we describe an extension of our system (namely UET-CAM) that participated in the BioCreative V CDR. The original UET-CAM system's performance was ranked fourth among 18 participating systems by the BioCreative CDR track committee. In the Disease Named Entity Recognition and Normalization (DNER) phase, our system employed joint inference (decoding) with a perceptron-based named entity recognizer (NER) and a back-off model with Semantic Supervised Indexing and Skip-gram for named entity normalization. In the chemical-induced disease (CID) relation extraction phase, we proposed a pipeline that includes a coreference resolution module and a Support Vector Machine relation extraction model. The former module utilized a multi-pass sieve to extend entity recall. In this article, the UET-CAM system was improved by adding a 'silver' CID corpus to train the prediction model. This silver standard corpus of more than 50 thousand sentences was automatically built based on the Comparative Toxicogenomics Database (CTD) database. We evaluated our method on the CDR test set. Results showed that our system could reach the state of the art performance with F1 of 82.44 for the DNER task and 58.90 for the CID task. Analysis demonstrated substantial benefits of both the multi-pass sieve coreference resolution method (F1 + 4.13%) and the silver CID corpus (F1 +7.3%).Database URL: SilverCID-The silver-standard corpus for CID relation extraction is freely online available at: https://zenodo.org/record/34530 (doi:10.5281/zenodo.34530).

摘要

生物创意V化学-疾病关系(CDR)赛道旨在加快文本挖掘在促进对化学物质、疾病及其关系的综合理解方面的进展。在本文中,我们描述了参与生物创意V CDR的系统(即UET-CAM)的扩展。生物创意CDR赛道委员会将原始UET-CAM系统的性能在18个参与系统中排名第四。在疾病命名实体识别与规范化(DNER)阶段,我们的系统采用基于感知器的命名实体识别器(NER)进行联合推理(解码),并使用具有语义监督索引和跳字模型的回退模型进行命名实体规范化。在化学诱导疾病(CID)关系提取阶段,我们提出了一个包含共指消解模块和支持向量机关系提取模型的管道。前一个模块利用多遍筛法来扩展实体召回率。在本文中,通过添加一个“银”CID语料库来训练预测模型,对UET-CAM系统进行了改进。这个超过5万句的银标准语料库是基于比较毒理基因组学数据库(CTD)自动构建的。我们在CDR测试集上评估了我们的方法。结果表明,我们的系统在DNER任务中F1值达到82.44,在CID任务中F1值达到58.90,达到了当前的最优性能。分析表明,多遍筛法共指消解方法(F1提高4.13%)和银CID语料库(F1提高7.3%)都有显著益处。数据库网址:SilverCID - 用于CID关系提取的银标准语料库可在以下网址免费在线获取:https://zenodo.org/record/34530 (doi:10.5281/zenodo.34530) 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26ff/4962668/af025f7a94bd/baw102f1p.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验