Yu Shi, Van Vooren Steven, Tranchevent Leon-Charles, De Moor Bart, Moreau Yves
Department of Electrical Engineering, Bioinformatics Group, SCD, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium.
Bioinformatics. 2008 Aug 15;24(16):i119-25. doi: 10.1093/bioinformatics/btn291.
Computational gene prioritization methods are useful to help identify susceptibility genes potentially being involved in genetic disease. Recently, text mining techniques have been applied to extract prior knowledge from text-based genomic information sources and this knowledge can be used to improve the prioritization process. However, the effect of various vocabularies, representations and ranking algorithms on text mining for gene prioritization is still an issue that requires systematic and comparative studies. Therefore, a benchmark study about the vocabularies, representations and ranking algorithms in gene prioritization by text mining is discussed in this article.
We investigated 5 different domain vocabularies, 2 text representation schemes and 27 linear ranking algorithms for disease gene prioritization by text mining. We indexed 288 177 MEDLINE titles and abstracts with the TXTGate text pro.ling system and adapted the benchmark dataset of the Endeavour gene prioritization system that consists of 618 disease-causing genes. Textual gene pro.les were created and their performance for prioritization were evaluated and discussed in a comparative manner. The results show that inverse document frequency-based representation of gene term vectors performs better than the term-frequency inverse document-frequency representation. The eVOC and MESH domain vocabularies perform better than Gene Ontology, Online Mendelian Inheritance in Man's and London Dysmorphology Database. The ranking algorithms based on 1-SVM, Standard Correlation and Ward linkage method provide the best performance.
The MATLAB code of the algorithm and benchmark datasets are available by request.
Supplementary data are available at Bioinformatics online.
计算基因优先级排序方法有助于识别可能与遗传疾病相关的易感基因。最近,文本挖掘技术已被应用于从基于文本的基因组信息源中提取先验知识,并且这些知识可用于改进优先级排序过程。然而,各种词汇表、表示法和排序算法对用于基因优先级排序的文本挖掘的影响仍是一个需要进行系统和比较研究的问题。因此,本文讨论了一项关于文本挖掘在基因优先级排序中词汇表、表示法和排序算法的基准研究。
我们研究了5种不同的领域词汇表、2种文本表示方案和27种线性排序算法,用于通过文本挖掘进行疾病基因优先级排序。我们使用TXTGate文本剖析系统对288177篇MEDLINE标题和摘要进行了索引,并采用了由618个致病基因组成的Endeavour基因优先级排序系统的基准数据集。创建了文本基因概况,并以比较的方式评估和讨论了它们的优先级排序性能。结果表明,基于逆文档频率的基因术语向量表示比词频逆文档频率表示表现更好。eVOC和MESH领域词汇表比基因本体论、《人类孟德尔遗传在线》和伦敦畸形数据库表现更好。基于1-SVM、标准相关性和沃德连锁法的排序算法提供了最佳性能。
可根据请求提供算法的MATLAB代码和基准数据集。
补充数据可在《生物信息学》在线获取。