通过文本挖掘进行基因优先级排序的词汇表、表示法和排序算法比较

Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining.

作者信息

Yu Shi, Van Vooren Steven, Tranchevent Leon-Charles, De Moor Bart, Moreau Yves

机构信息

Department of Electrical Engineering, Bioinformatics Group, SCD, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium.

出版信息

Bioinformatics. 2008 Aug 15;24(16):i119-25. doi: 10.1093/bioinformatics/btn291.

DOI:10.1093/bioinformatics/btn291

PMID:18689812

Abstract

MOTIVATION

Computational gene prioritization methods are useful to help identify susceptibility genes potentially being involved in genetic disease. Recently, text mining techniques have been applied to extract prior knowledge from text-based genomic information sources and this knowledge can be used to improve the prioritization process. However, the effect of various vocabularies, representations and ranking algorithms on text mining for gene prioritization is still an issue that requires systematic and comparative studies. Therefore, a benchmark study about the vocabularies, representations and ranking algorithms in gene prioritization by text mining is discussed in this article.

RESULTS

We investigated 5 different domain vocabularies, 2 text representation schemes and 27 linear ranking algorithms for disease gene prioritization by text mining. We indexed 288 177 MEDLINE titles and abstracts with the TXTGate text pro.ling system and adapted the benchmark dataset of the Endeavour gene prioritization system that consists of 618 disease-causing genes. Textual gene pro.les were created and their performance for prioritization were evaluated and discussed in a comparative manner. The results show that inverse document frequency-based representation of gene term vectors performs better than the term-frequency inverse document-frequency representation. The eVOC and MESH domain vocabularies perform better than Gene Ontology, Online Mendelian Inheritance in Man's and London Dysmorphology Database. The ranking algorithms based on 1-SVM, Standard Correlation and Ward linkage method provide the best performance.

AVAILABILITY

The MATLAB code of the algorithm and benchmark datasets are available by request.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

计算基因优先级排序方法有助于识别可能与遗传疾病相关的易感基因。最近，文本挖掘技术已被应用于从基于文本的基因组信息源中提取先验知识，并且这些知识可用于改进优先级排序过程。然而，各种词汇表、表示法和排序算法对用于基因优先级排序的文本挖掘的影响仍是一个需要进行系统和比较研究的问题。因此，本文讨论了一项关于文本挖掘在基因优先级排序中词汇表、表示法和排序算法的基准研究。

结果

我们研究了5种不同的领域词汇表、2种文本表示方案和27种线性排序算法，用于通过文本挖掘进行疾病基因优先级排序。我们使用TXTGate文本剖析系统对288177篇MEDLINE标题和摘要进行了索引，并采用了由618个致病基因组成的Endeavour基因优先级排序系统的基准数据集。创建了文本基因概况，并以比较的方式评估和讨论了它们的优先级排序性能。结果表明，基于逆文档频率的基因术语向量表示比词频逆文档频率表示表现更好。eVOC和MESH领域词汇表比基因本体论、《人类孟德尔遗传在线》和伦敦畸形数据库表现更好。基于1-SVM、标准相关性和沃德连锁法的排序算法提供了最佳性能。

可用性

可根据请求提供算法的MATLAB代码和基准数据集。

补充信息

补充数据可在《生物信息学》在线获取。

相似文献

Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining.

Bioinformatics. 2008 Aug 15;24(16):i119-25. doi: 10.1093/bioinformatics/btn291.

Inter-species normalization of gene mentions with GNAT.

Bioinformatics. 2008 Aug 15;24(16):i126-132. doi: 10.1093/bioinformatics/btn299.

Combination of text-mining algorithms increases the performance.

Bioinformatics. 2006 Sep 1;22(17):2151-7. doi: 10.1093/bioinformatics/btl281. Epub 2006 Jun 9.

Gene symbol disambiguation using knowledge-based profiles.

Bioinformatics. 2007 Apr 15;23(8):1015-22. doi: 10.1093/bioinformatics/btm056. Epub 2007 Feb 21.

Kernel-based data fusion for gene prioritization.

Bioinformatics. 2007 Jul 1;23(13):i125-32. doi: 10.1093/bioinformatics/btm187.

Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies?

Brief Bioinform. 2008 Nov;9(6):466-78. doi: 10.1093/bib/bbn043. Epub 2008 Dec 6.

Integrative mining of traditional Chinese medicine literature and MEDLINE for functional gene networks.

Artif Intell Med. 2007 Oct;41(2):87-104. doi: 10.1016/j.artmed.2007.07.007. Epub 2007 Sep 5.

A quantitative model for linking two disparate sets of articles in MEDLINE.

Bioinformatics. 2007 Jul 1;23(13):1658-65. doi: 10.1093/bioinformatics/btm161. Epub 2007 Apr 26.

Text mining.

Methods Mol Biol. 2008;453:471-91. doi: 10.1007/978-1-60327-429-6_25.

PuReD-MCL: a graph-based PubMed document clustering methodology.

Bioinformatics. 2008 Sep 1;24(17):1935-41. doi: 10.1093/bioinformatics/btn318. Epub 2008 Jul 1.

引用本文的文献

Predicting disease-related genes using integrated biomedical networks.

BMC Genomics. 2017 Jan 25;18(Suppl 1):1043. doi: 10.1186/s12864-016-3263-4.

Text mining applications in psychiatry: a systematic literature review.

Int J Methods Psychiatr Res. 2016 Jun;25(2):86-100. doi: 10.1002/mpr.1481. Epub 2015 Jul 17.

A random set scoring model for prioritization of disease candidate genes using protein complexes and data-mining of GeneRIF, OMIM and PubMed records.

BMC Bioinformatics. 2014 Sep 24;15(1):315. doi: 10.1186/1471-2105-15-315.

An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods.

Artif Intell Med. 2014 Jun;61(2):63-78. doi: 10.1016/j.artmed.2014.03.003. Epub 2014 Mar 20.

The Growing Importance of CNVs: New Insights for Detection and Clinical Interpretation.

Front Genet. 2013 May 30;4:92. doi: 10.3389/fgene.2013.00092. eCollection 2013.

Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data.

BMC Genomics. 2012;13 Suppl 7(Suppl 7):S27. doi: 10.1186/1471-2164-13-S7-S27. Epub 2012 Dec 13.

Inferring novel gene-disease associations using Medical Subject Heading Over-representation Profiles.

Genome Med. 2012 Sep 28;4(9):75. doi: 10.1186/gm376. eCollection 2012.

Caipirini: using gene sets to rank literature.

BioData Min. 2012 Feb 1;5(1):1. doi: 10.1186/1756-0381-5-1.

BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs.

BMC Bioinformatics. 2011 Apr 21;12:112. doi: 10.1186/1471-2105-12-112.

Improving disease gene prioritization using the semantic similarity of Gene Ontology terms.

Bioinformatics. 2010 Sep 15;26(18):i561-7. doi: 10.1093/bioinformatics/btq384.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过文本挖掘进行基因优先级排序的词汇表、表示法和排序算法比较

Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining.

作者信息

Yu Shi, Van Vooren Steven, Tranchevent Leon-Charles, De Moor Bart, Moreau Yves

机构信息

Department of Electrical Engineering, Bioinformatics Group, SCD, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium.

出版信息

Bioinformatics. 2008 Aug 15;24(16):i119-25. doi: 10.1093/bioinformatics/btn291.

DOI:10.1093/bioinformatics/btn291

PMID:18689812

Abstract

MOTIVATION

RESULTS

AVAILABILITY

The MATLAB code of the algorithm and benchmark datasets are available by request.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

结果

可用性

可根据请求提供算法的MATLAB代码和基准数据集。

补充信息

补充数据可在《生物信息学》在线获取。

通过文本挖掘进行基因优先级排序的词汇表、表示法和排序算法比较

Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

SUPPLEMENTARY INFORMATION

动机

结果

可用性

补充信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

通过文本挖掘进行基因优先级排序的词汇表、表示法和排序算法比较

Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

SUPPLEMENTARY INFORMATION

动机

结果

可用性

补充信息

相似文献

引用本文的文献