Fröhlich Holger, Fellmann Mark, Sültmann Holger, Poustka Annemarie, Beissbarth Tim
German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69120 Heidelberg, Germany.
Bioinformatics. 2008 Oct 1;24(19):2137-42. doi: 10.1093/bioinformatics/btn403. Epub 2008 Aug 1.
Functional characterization of genes is of great importance for the understanding of complex cellular processes. Valuable information for this purpose can be obtained from pathway databases, like KEGG. However, only a small fraction of genes is annotated with pathway information up to now. In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database.
We present a classification model, which for a specific gene of interest can predict the mapping to a KEGG pathway, based on its domain signature. The classifier makes explicit use of the hierarchical organization of pathways in the KEGG database. Furthermore, we take into account that a specific gene can be mapped to different pathways at the same time. The classification method produces a scoring of all possible mapping positions of the gene in the KEGG hierarchy. Evaluations of our model, which is a combination of a SVM and ranking perceptron approach, show a high prediction performance. Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components.
The R package gene2pathway is a supplement to this article.
基因的功能表征对于理解复杂的细胞过程非常重要。为此目的的有价值信息可以从通路数据库(如KEGG)中获得。然而,到目前为止,只有一小部分基因被标注了通路信息。相比之下,可以从InterPro数据库等获得更多基因的包含蛋白质结构域的信息。
我们提出了一种分类模型,该模型可以基于感兴趣的特定基因的结构域特征预测其到KEGG通路的映射。分类器明确利用了KEGG数据库中通路的层次结构。此外,我们考虑到一个特定基因可以同时映射到不同的通路。分类方法对基因在KEGG层次结构中的所有可能映射位置进行评分。我们的模型是支持向量机和排序感知器方法的组合,对其评估显示出很高的预测性能。此外,对于信号通路,我们发现甚至有可能准确预测个体通路成分的成员资格。
R包gene2pathway是本文的补充。