Suppr超能文献

利用n元语法和元数据特征对基因型和表型数据库(dbGaP)中的心脏、肺和血液研究进行文本分类。

Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features.

作者信息

Ross Mindy K, Lin Ko-Wei, Truong Karen, Kumar Abhishek, Conway Mike

机构信息

Department of Pediatrics, Division of Respiratory Medicine, University of California, San Diego, USA. ; Department of Medicine, Division of Biomedical Informatics, University of California, San Diego, USA.

出版信息

Biomed Inform Insights. 2013 Jul 22;6:35-45. doi: 10.4137/BII.S11987. Print 2013.

Abstract

The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contribution to genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, effective use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques to improve study retrieval in the context of the dbGaP database. We utilized standard machine learning algorithms (naive Bayes, support vector machines, and the C4.5 decision tree) trained on dbGaP study text and incorporated n-gram features and study metadata to identify heart, lung, and blood studies. We used the χ(2) feature selection algorithm to identify features that contributed most to classification performance and experimented with dbGaP associated PubMed papers as a proxy for topicality. Classifier performance was favorable in comparison to keyword-based search results. It was determined that text categorization is a useful complement to document retrieval techniques in the dbGaP.

摘要

基因型与表型数据库(dbGaP)使研究人员能够了解表型对遗传疾病的影响,提出新的假设,证实先前的研究结果,并识别对照人群。然而,对该数据库的有效利用受到研究检索效果不佳的阻碍。我们的目标是评估文本分类技术,以改善dbGaP数据库环境下的研究检索。我们利用在dbGaP研究文本上训练的标准机器学习算法(朴素贝叶斯、支持向量机和C4.5决策树),并结合n元语法特征和研究元数据来识别心脏、肺和血液方面的研究。我们使用χ(2)特征选择算法来识别对分类性能贡献最大的特征,并以与dbGaP相关的PubMed论文作为主题性的代理进行实验。与基于关键词的搜索结果相比,分类器性能良好。结果表明,文本分类是dbGaP中文档检索技术的有益补充。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce5e/3728208/fcd7a1f9d6c3/bii-6-2013-035f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验