Steele E, Tucker A, 't Hoen P A C, Schuemie M J
Centre for Intelligent Data Analysis, School of Information Systems, Computing and Mathematics, Brunel University, Uxbridge UB8 3PH, UK.
Bioinformatics. 2009 Jul 15;25(14):1768-74. doi: 10.1093/bioinformatics/btp277. Epub 2009 Apr 23.
The use of prior knowledge to improve gene regulatory network modelling has often been proposed. In this article we present the first research on the massive incorporation of prior knowledge from literature for Bayesian network learning of gene networks. As the publication rate of scientific papers grows, updating online databases, which have been proposed as potential prior knowledge in past research, becomes increasingly challenging. The novelty of our approach lies in the use of gene-pair association scores that describe the overlap in the contexts in which the genes are mentioned, generated from a large database of scientific literature, harnessing the information contained in a huge number of documents into a simple, clear format.
We present a method to transform such literature-based gene association scores to network prior probabilities, and apply it to learn gene sub-networks for yeast, Escherichia coli and Human organisms. We also investigate the effect of weighting the influence of the prior knowledge. Our findings show that literature-based priors can improve both the number of true regulatory interactions present in the network and the accuracy of expression value prediction on genes, in comparison to a network learnt solely from expression data. Networks learnt with priors also show an improved biological interpretation, with identified subnetworks that coincide with known biological pathways.
利用先验知识来改进基因调控网络建模的方法屡被提出。在本文中,我们首次开展了关于大规模纳入文献中的先验知识以进行基因网络贝叶斯网络学习的研究。随着科学论文发表率的增长,更新在线数据库(过去的研究曾将其作为潜在的先验知识提出)变得越来越具有挑战性。我们方法的新颖之处在于使用了基因对关联分数,该分数描述了基因被提及的上下文的重叠情况,它由一个庞大的科学文献数据库生成,将大量文档中包含的信息转化为一种简单、清晰的格式。
我们提出了一种将基于文献的基因关联分数转换为网络先验概率的方法,并将其应用于学习酵母、大肠杆菌和人类生物体的基因子网。我们还研究了对先验知识的影响进行加权的效果。我们的研究结果表明,与仅从表达数据学习得到的网络相比,基于文献的先验知识既能提高网络中存在的真实调控相互作用的数量,又能提高基因表达值预测的准确性。利用先验知识学习得到的网络在生物学解释方面也有所改进,识别出的子网与已知的生物途径相符。