Suppr超能文献

HypertensionGene:从生物医学文献中提取关键高血压基因,使用位置和自动生成的模板特征。

HypertenGene: extracting key hypertension genes from biomedical literature with position and automatically-generated template features.

机构信息

Department of Computer Science and Engineering, Yuan Ze University, Chung Li, Taiwan, Republic of China.

出版信息

BMC Bioinformatics. 2009 Dec 3;10 Suppl 15(Suppl 15):S9. doi: 10.1186/1471-2105-10-S15-S9.

Abstract

BACKGROUND

The genetic factors leading to hypertension have been extensively studied, and large numbers of research papers have been published on the subject. One of hypertension researchers' primary research tasks is to locate key hypertension-related genes in abstracts. However, gathering such information with existing tools is not easy: (1) Searching for articles often returns far too many hits to browse through. (2) The search results do not highlight the hypertension-related genes discovered in the abstract. (3) Even though some text mining services mark up gene names in the abstract, the key genes investigated in a paper are still not distinguished from other genes. To facilitate the information gathering process for hypertension researchers, one solution would be to extract the key hypertension-related genes in each abstract. Three major tasks are involved in the construction of this system: (1) gene and hypertension named entity recognition, (2) section categorization, and (3) gene-hypertension relation extraction.

RESULTS

We first compare the retrieval performance achieved by individually adding template features and position features to the baseline system. Then, the combination of both is examined. We found that using position features can almost double the original AUC score (0.8140 vs.0.4936) of the baseline system. However, adding template features only results in marginal improvement (0.0197). Including both improves AUC to 0.8184, indicating that these two sets of features are complementary, and do not have overlapping effects. We then examine the performance in a different domain--diabetes, and the result shows a satisfactory AUC of 0.83.

CONCLUSION

Our approach successfully exploits template features to recognize true hypertension-related gene mentions and position features to distinguish key genes from other related genes. Templates are automatically generated and checked by biologists to minimize labor costs. Our approach integrates the advantages of machine learning models and pattern matching. To the best of our knowledge, this the first systematic study of extracting hypertension-related genes and the first attempt to create a hypertension-gene relation corpus based on the GAD database. Furthermore, our paper proposes and tests novel features for extracting key hypertension genes, such as relative position, section, and template features, which could also be applied to key-gene extraction for other diseases.

摘要

背景

导致高血压的遗传因素已被广泛研究,并且已经发表了大量关于该主题的研究论文。高血压研究人员的主要研究任务之一是在摘要中定位关键的高血压相关基因。然而,使用现有工具收集此类信息并不容易:(1) 搜索文章经常会返回过多的命中结果,难以浏览。(2) 搜索结果并未突出摘要中发现的与高血压相关的基因。(3) 即使一些文本挖掘服务会在摘要中标注基因名称,但论文中调查的关键基因仍无法与其他基因区分开来。为了方便高血压研究人员的信息收集过程,一种解决方案是提取每个摘要中的关键高血压相关基因。该系统的构建涉及三个主要任务:(1) 基因和高血压命名实体识别,(2) 部分分类,(3) 基因-高血压关系提取。

结果

我们首先比较了单独向基线系统添加模板特征和位置特征时的检索性能。然后,我们检查了两者的组合。我们发现使用位置特征几乎可以将原始 AUC 分数(0.8140 对 0.4936)提高一倍。然而,添加模板特征仅导致微小的改进(0.0197)。同时包含两者可以将 AUC 提高到 0.8184,表明这两组特征是互补的,没有重叠的效果。然后,我们在不同的领域——糖尿病中检查了性能,结果显示令人满意的 AUC 为 0.83。

结论

我们的方法成功地利用模板特征来识别真正的高血压相关基因提及,并利用位置特征将关键基因与其他相关基因区分开来。模板由生物学家自动生成和检查,以最大限度地降低劳动力成本。我们的方法集成了机器学习模型和模式匹配的优势。据我们所知,这是首次系统地研究提取高血压相关基因的方法,也是首次尝试基于 GAD 数据库创建高血压-基因关系语料库。此外,我们的论文提出并测试了用于提取关键高血压基因的新特征,例如相对位置、部分和模板特征,这些特征也可应用于其他疾病的关键基因提取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9382/2788360/e47105721afc/1471-2105-10-S15-S9-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验