Suppr超能文献

整合各种资源进行基因名称标准化。

Integrating various resources for gene name normalization.

机构信息

School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, China.

出版信息

PLoS One. 2012;7(9):e43558. doi: 10.1371/journal.pone.0043558. Epub 2012 Sep 12.

Abstract

The recognition and normalization of gene mentions in biomedical literature are crucial steps in biomedical text mining. We present a system for extracting gene names from biomedical literature and normalizing them to gene identifiers in databases. The system consists of four major components: gene name recognition, entity mapping, disambiguation and filtering. The first component is a gene name recognizer based on dictionary matching and semi-supervised learning, which utilizes the co-occurrence information of a large amount of unlabeled MEDLINE abstracts to enhance feature representation of gene named entities. In the stage of entity mapping, we combine the strategies of exact match and approximate match to establish linkage between gene names in the context and the EntrezGene database. For the gene names that map to more than one database identifiers, we develop a disambiguation method based on semantic similarity derived from the Gene Ontology and MEDLINE abstracts. To remove the noise produced in the previous steps, we design a filtering method based on the confidence scores in the dictionary used for NER. The system is able to adjust the trade-off between precision and recall based on the result of filtering. It achieves an F-measure of 83% (precision: 82.5% recall: 83.5%) on BioCreative II Gene Normalization (GN) dataset, which is comparable to the current state-of-the-art.

摘要

生物医学文献中基因提及的识别和规范化是生物医学文本挖掘的关键步骤。我们提出了一种从生物医学文献中提取基因名称并将其规范化到数据库中基因标识符的系统。该系统由四个主要组件组成:基因名称识别、实体映射、消歧和过滤。第一个组件是基于字典匹配和半监督学习的基因名称识别器,它利用大量未标记的 MEDLINE 摘要的共现信息来增强基因命名实体的特征表示。在实体映射阶段,我们结合精确匹配和近似匹配的策略,在上下文中建立基因名称与 EntrezGene 数据库之间的联系。对于映射到多个数据库标识符的基因名称,我们开发了一种基于从基因本体论和 MEDLINE 摘要中得出的语义相似性的消歧方法。为了去除前几个步骤产生的噪声,我们设计了一种基于字典中用于 NER 的置信度得分的过滤方法。该系统能够根据过滤结果调整精度和召回率之间的权衡。在 BioCreative II 基因规范化 (GN) 数据集上,它的 F 度量达到了 83%(精度:82.5%,召回率:83.5%),与当前的最先进水平相当。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdfd/3440407/314c9614bce3/pone.0043558.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验