利用n元语法和元数据特征对基因型和表型数据库（dbGaP）中的心脏、肺和血液研究进行文本分类。

Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features.

作者信息

Ross Mindy K, Lin Ko-Wei, Truong Karen, Kumar Abhishek, Conway Mike

机构信息

Department of Pediatrics, Division of Respiratory Medicine, University of California, San Diego, USA. ; Department of Medicine, Division of Biomedical Informatics, University of California, San Diego, USA.

出版信息

Biomed Inform Insights. 2013 Jul 22;6:35-45. doi: 10.4137/BII.S11987. Print 2013.

DOI:10.4137/BII.S11987

PMID:23926434

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3728208/

Abstract

The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contribution to genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, effective use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques to improve study retrieval in the context of the dbGaP database. We utilized standard machine learning algorithms (naive Bayes, support vector machines, and the C4.5 decision tree) trained on dbGaP study text and incorporated n-gram features and study metadata to identify heart, lung, and blood studies. We used the χ(2) feature selection algorithm to identify features that contributed most to classification performance and experimented with dbGaP associated PubMed papers as a proxy for topicality. Classifier performance was favorable in comparison to keyword-based search results. It was determined that text categorization is a useful complement to document retrieval techniques in the dbGaP.

摘要

基因型与表型数据库（dbGaP）使研究人员能够了解表型对遗传疾病的影响，提出新的假设，证实先前的研究结果，并识别对照人群。然而，对该数据库的有效利用受到研究检索效果不佳的阻碍。我们的目标是评估文本分类技术，以改善dbGaP数据库环境下的研究检索。我们利用在dbGaP研究文本上训练的标准机器学习算法（朴素贝叶斯、支持向量机和C4.5决策树），并结合n元语法特征和研究元数据来识别心脏、肺和血液方面的研究。我们使用χ(2)特征选择算法来识别对分类性能贡献最大的特征，并以与dbGaP相关的PubMed论文作为主题性的代理进行实验。与基于关键词的搜索结果相比，分类器性能良好。结果表明，文本分类是dbGaP中文档检索技术的有益补充。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce5e/3728208/fcd7a1f9d6c3/bii-6-2013-035f1.jpg

相似文献

Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features.利用n元语法和元数据特征对基因型和表型数据库（dbGaP）中的心脏、肺和血液研究进行文本分类。

Biomed Inform Insights. 2013 Jul 22;6:35-45. doi: 10.4137/BII.S11987. Print 2013.

PhenDisco: phenotype discovery system for the database of genotypes and phenotypes.PhenDisco：表型发现系统，用于基因型和表型数据库。

J Am Med Inform Assoc. 2014 Jan-Feb;21(1):31-6. doi: 10.1136/amiajnl-2013-001882. Epub 2013 Aug 29.

Classifying disease outbreak reports using n-grams and semantic features.利用 n 元组和语义特征对疾病爆发报告进行分类。

Int J Med Inform. 2009 Dec;78(12):e47-58. doi: 10.1016/j.ijmedinf.2009.03.010. Epub 2009 May 15.

Prediction of cause of death from forensic autopsy reports using text classification techniques: A comparative study.使用文本分类技术从法医尸检报告预测死亡原因：一项比较研究。

J Forensic Leg Med. 2018 Jul;57:41-50. doi: 10.1016/j.jflm.2017.07.001. Epub 2017 Jul 4.

PDF text classification to leverage information extraction from publication reports.利用出版物报告中的信息提取进行PDF文本分类。

J Biomed Inform. 2016 Jun;61:141-8. doi: 10.1016/j.jbi.2016.03.026. Epub 2016 Apr 1.

GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis.GRAF-pop：一种无需主成分分析即可基于距离推断个体祖先的快速方法，适用于多种基因型数据集。

G3 (Bethesda). 2019 Aug 8;9(8):2447-2461. doi: 10.1534/g3.118.200925.

Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents.计算N元语法的对称强度：文本文件自动分类中的两遍过滤方法。

Springerplus. 2016 Jun 30;5(1):942. doi: 10.1186/s40064-016-2573-y. eCollection 2016.

Natural language processing and machine learning algorithm to identify brain MRI reports with acute ischemic stroke.自然语言处理和机器学习算法识别急性缺血性脑卒中的脑部 MRI 报告。

PLoS One. 2019 Feb 28;14(2):e0212778. doi: 10.1371/journal.pone.0212778. eCollection 2019.

Protein classification based on text document classification techniques.基于文本文档分类技术的蛋白质分类。

Proteins. 2005 Mar 1;58(4):955-70. doi: 10.1002/prot.20373.

Using discourse analysis to improve text categorization in MEDLINE.运用话语分析改进医学在线数据库（MEDLINE）中的文本分类

Stud Health Technol Inform. 2007;129(Pt 1):710-5.

引用本文的文献

Predicting the occurrence of surgical site infections using text mining and machine learning.利用文本挖掘和机器学习预测手术部位感染的发生。

PLoS One. 2019 Dec 13;14(12):e0226272. doi: 10.1371/journal.pone.0226272. eCollection 2019.

Learning regular expressions for clinical text classification.学习正则表达式进行临床文本分类。

J Am Med Inform Assoc. 2014 Sep-Oct;21(5):850-7. doi: 10.1136/amiajnl-2013-002411. Epub 2014 Feb 27.

PhenDisco: phenotype discovery system for the database of genotypes and phenotypes.PhenDisco：表型发现系统，用于基因型和表型数据库。

J Am Med Inform Assoc. 2014 Jan-Feb;21(1):31-6. doi: 10.1136/amiajnl-2013-001882. Epub 2013 Aug 29.

本文引用的文献

Literature retrieval and mining in bioinformatics: state of the art and challenges.生物信息学中的文献检索与挖掘：现状与挑战

Adv Bioinformatics. 2012;2012:573846. doi: 10.1155/2012/573846. Epub 2012 Jun 21.

Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection.疫苗不良事件报告系统的文本挖掘：使用信息特征选择的医学文本分类。

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):631-8. doi: 10.1136/amiajnl-2010-000022. Epub 2011 Jun 27.

Towards classifying species in systems biology papers using text mining.利用文本挖掘对系统生物学论文中的物种进行分类

BMC Res Notes. 2011 Feb 4;4:32. doi: 10.1186/1756-0500-4-32.

Replication in genome-wide association studies.全基因组关联研究中的复制

Stat Sci. 2009 Nov 1;24(4):561-573. doi: 10.1214/09-STS290.

Classifying disease outbreak reports using n-grams and semantic features.利用 n 元组和语义特征对疾病爆发报告进行分类。

Int J Med Inform. 2009 Dec;78(12):e47-58. doi: 10.1016/j.ijmedinf.2009.03.010. Epub 2009 May 15.

MeSH Up: effective MeSH text classification for improved document retrieval.医学主题词表升级：用于改进文档检索的有效医学主题词表文本分类。

Bioinformatics. 2009 Jun 1;25(11):1412-8. doi: 10.1093/bioinformatics/btp249. Epub 2009 Apr 17.

MScanner: a classifier for retrieving Medline citations.MScanner：一种用于检索医学文献数据库（Medline）引用文献的分类器。

BMC Bioinformatics. 2008 Feb 19;9:108. doi: 10.1186/1471-2105-9-108.

The NCBI dbGaP database of genotypes and phenotypes.美国国立医学图书馆的基因型和表型数据库（NCBI dbGaP）。

Nat Genet. 2007 Oct;39(10):1181-6. doi: 10.1038/ng1007-1181.

Automating document classification for the Immune Epitope Database.免疫表位数据库的文档分类自动化

BMC Bioinformatics. 2007 Jul 26;8:269. doi: 10.1186/1471-2105-8-269.

Supporting the curation of biological databases with reusable text mining.利用可重复使用的文本挖掘技术支持生物数据库的管理。

Genome Inform. 2005;16(2):32-44.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用n元语法和元数据特征对基因型和表型数据库（dbGaP）中的心脏、肺和血液研究进行文本分类。

Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献