HypertensionGene：从生物医学文献中提取关键高血压基因，使用位置和自动生成的模板特征。

HypertenGene: extracting key hypertension genes from biomedical literature with position and automatically-generated template features.

机构信息

Department of Computer Science and Engineering, Yuan Ze University, Chung Li, Taiwan, Republic of China.

出版信息

BMC Bioinformatics. 2009 Dec 3;10 Suppl 15(Suppl 15):S9. doi: 10.1186/1471-2105-10-S15-S9.

DOI:10.1186/1471-2105-10-S15-S9

PMID:19958519

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2788360/

Abstract

BACKGROUND

The genetic factors leading to hypertension have been extensively studied, and large numbers of research papers have been published on the subject. One of hypertension researchers' primary research tasks is to locate key hypertension-related genes in abstracts. However, gathering such information with existing tools is not easy: (1) Searching for articles often returns far too many hits to browse through. (2) The search results do not highlight the hypertension-related genes discovered in the abstract. (3) Even though some text mining services mark up gene names in the abstract, the key genes investigated in a paper are still not distinguished from other genes. To facilitate the information gathering process for hypertension researchers, one solution would be to extract the key hypertension-related genes in each abstract. Three major tasks are involved in the construction of this system: (1) gene and hypertension named entity recognition, (2) section categorization, and (3) gene-hypertension relation extraction.

RESULTS

We first compare the retrieval performance achieved by individually adding template features and position features to the baseline system. Then, the combination of both is examined. We found that using position features can almost double the original AUC score (0.8140 vs.0.4936) of the baseline system. However, adding template features only results in marginal improvement (0.0197). Including both improves AUC to 0.8184, indicating that these two sets of features are complementary, and do not have overlapping effects. We then examine the performance in a different domain--diabetes, and the result shows a satisfactory AUC of 0.83.

CONCLUSION

Our approach successfully exploits template features to recognize true hypertension-related gene mentions and position features to distinguish key genes from other related genes. Templates are automatically generated and checked by biologists to minimize labor costs. Our approach integrates the advantages of machine learning models and pattern matching. To the best of our knowledge, this the first systematic study of extracting hypertension-related genes and the first attempt to create a hypertension-gene relation corpus based on the GAD database. Furthermore, our paper proposes and tests novel features for extracting key hypertension genes, such as relative position, section, and template features, which could also be applied to key-gene extraction for other diseases.

摘要

背景

导致高血压的遗传因素已被广泛研究，并且已经发表了大量关于该主题的研究论文。高血压研究人员的主要研究任务之一是在摘要中定位关键的高血压相关基因。然而，使用现有工具收集此类信息并不容易：(1) 搜索文章经常会返回过多的命中结果，难以浏览。(2) 搜索结果并未突出摘要中发现的与高血压相关的基因。(3) 即使一些文本挖掘服务会在摘要中标注基因名称，但论文中调查的关键基因仍无法与其他基因区分开来。为了方便高血压研究人员的信息收集过程，一种解决方案是提取每个摘要中的关键高血压相关基因。该系统的构建涉及三个主要任务：(1) 基因和高血压命名实体识别，(2) 部分分类，(3) 基因-高血压关系提取。

结果

我们首先比较了单独向基线系统添加模板特征和位置特征时的检索性能。然后，我们检查了两者的组合。我们发现使用位置特征几乎可以将原始 AUC 分数(0.8140 对 0.4936)提高一倍。然而，添加模板特征仅导致微小的改进(0.0197)。同时包含两者可以将 AUC 提高到 0.8184，表明这两组特征是互补的，没有重叠的效果。然后，我们在不同的领域——糖尿病中检查了性能，结果显示令人满意的 AUC 为 0.83。

结论

我们的方法成功地利用模板特征来识别真正的高血压相关基因提及，并利用位置特征将关键基因与其他相关基因区分开来。模板由生物学家自动生成和检查，以最大限度地降低劳动力成本。我们的方法集成了机器学习模型和模式匹配的优势。据我们所知，这是首次系统地研究提取高血压相关基因的方法，也是首次尝试基于 GAD 数据库创建高血压-基因关系语料库。此外，我们的论文提出并测试了用于提取关键高血压基因的新特征，例如相对位置、部分和模板特征，这些特征也可应用于其他疾病的关键基因提取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9382/2788360/e47105721afc/1471-2105-10-S15-S9-1.jpg

相似文献

HypertenGene: extracting key hypertension genes from biomedical literature with position and automatically-generated template features.

BMC Bioinformatics. 2009 Dec 3;10 Suppl 15(Suppl 15):S9. doi: 10.1186/1471-2105-10-S15-S9.

BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature.

BMC Bioinformatics. 2009 Dec 3;10 Suppl 15(Suppl 15):S7. doi: 10.1186/1471-2105-10-S15-S7.

PubMed-EX: a web browser extension to enhance PubMed search with text mining features.

Bioinformatics. 2009 Nov 15;25(22):3031-2. doi: 10.1093/bioinformatics/btp475. Epub 2009 Aug 4.

An annotated dataset for extracting gene-melanoma relations from scientific literature.

J Biomed Semantics. 2022 Jan 19;13(1):2. doi: 10.1186/s13326-021-00251-3.

Text mining tools for extracting information about microbial biodiversity in food.

Food Microbiol. 2019 Aug;81:63-75. doi: 10.1016/j.fm.2018.04.011. Epub 2018 Apr 21.

BelSmile: a biomedical semantic role labeling approach for extracting biological expression language from text.

Database (Oxford). 2016 May 12;2016. doi: 10.1093/database/baw064. Print 2016.

Extracting semantically enriched events from biomedical literature.

BMC Bioinformatics. 2012 May 23;13:108. doi: 10.1186/1471-2105-13-108.

GPDminer: a tool for extracting named entities and analyzing relations in biological literature.

BMC Bioinformatics. 2024 Mar 6;25(1):101. doi: 10.1186/s12859-024-05710-z.

Automated recognition of malignancy mentions in biomedical literature.

BMC Bioinformatics. 2006 Nov 7;7:492. doi: 10.1186/1471-2105-7-492.

bioNerDS: exploring bioinformatics' database and software use through literature mining.

BMC Bioinformatics. 2013 Jun 15;14:194. doi: 10.1186/1471-2105-14-194.

引用本文的文献

A novel method for gathering and prioritizing disease candidate genes based on construction of a set of disease-related MeSH® terms.

BMC Bioinformatics. 2014 Jun 10;15:179. doi: 10.1186/1471-2105-15-179.

T-HOD: a literature-based candidate gene database for hypertension, obesity and diabetes.

Database (Oxford). 2013 Feb 12;2013:bas061. doi: 10.1093/database/bas061. Print 2013.

An overview of the BioCreative 2012 Workshop Track III: interactive text mining task.

Database (Oxford). 2013 Jan 17;2013:bas056. doi: 10.1093/database/bas056. Print 2013.

Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.

PLoS One. 2012;7(6):e39230. doi: 10.1371/journal.pone.0039230. Epub 2012 Jun 26.

MeInfoText 2.0: gene methylation and cancer relation extraction from biomedical literature.

BMC Bioinformatics. 2011 Dec 14;12:471. doi: 10.1186/1471-2105-12-471.

Dynamic programming re-ranking for PPI interactor and pair extraction in full-text articles.

BMC Bioinformatics. 2011 Feb 23;12:60. doi: 10.1186/1471-2105-12-60.

Towards a career in bioinformatics.

BMC Bioinformatics. 2009 Dec 3;10 Suppl 15(Suppl 15):S1. doi: 10.1186/1471-2105-10-S15-S1.

本文引用的文献

Extraction of semantic biomedical relations from text using conditional random fields.

BMC Bioinformatics. 2008 Apr 23;9:207. doi: 10.1186/1471-2105-9-207.

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition.

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S11. doi: 10.1186/1471-2105-7-S5-S11.

Automatic recognition of topic-classified relations between prostate cancer and genes using MEDLINE abstracts.

BMC Bioinformatics. 2006 Nov 24;7 Suppl 3(Suppl 3):S4. doi: 10.1186/1471-2105-7-S3-S4.

Extraction of gene-disease relations from Medline using domain dictionaries and machine learning.

Pac Symp Biocomput. 2006:4-15.

Using argumentation to extract key sentences from biomedical abstracts.

Int J Med Inform. 2007 Feb-Mar;76(2-3):195-200. doi: 10.1016/j.ijmedinf.2006.05.002. Epub 2006 Jul 11.

Argument-predicate distance as a filter for enhancing precision in extracting predications on the genetic etiology of disease.

BMC Bioinformatics. 2006 Jun 8;7:291. doi: 10.1186/1471-2105-7-291.

Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome.

Genome Biol. 2005;6(5):R40. doi: 10.1186/gb-2005-6-5-r40. Epub 2005 Apr 15.

Comparative experiments on learning information extractors for proteins and their interactions.

Artif Intell Med. 2005 Feb;33(2):139-55. doi: 10.1016/j.artmed.2004.07.016.

The genetic association database.

Nat Genet. 2004 May;36(5):431-2. doi: 10.1038/ng0504-431.

GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data.

J Biomed Inform. 2004 Feb;37(1):43-53. doi: 10.1016/j.jbi.2003.10.001.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

HypertensionGene：从生物医学文献中提取关键高血压基因，使用位置和自动生成的模板特征。

HypertenGene: extracting key hypertension genes from biomedical literature with position and automatically-generated template features.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献