Toronto Health Economics and Technology Assessment (THETA) Collaborative, University of Toronto, Toronto, Canada.
BioData Min. 2013 Apr 2;6(1):8. doi: 10.1186/1756-0381-6-8.
While the genomes of hundreds of organisms have been sequenced and good approaches exist for finding protein encoding genes, an important remaining challenge is predicting the functions of the large fraction of genes for which there is no annotation. Large gene expression datasets from microarray experiments already exist and many of these can be used to help assign potential functions to these genes. We have applied Support Vector Machines (SVM), a sigmoid fitting function and a stratified cross-validation approach to analyze a large microarray experiment dataset from Drosophila melanogaster in order to predict possible functions for previously un-annotated genes. A total of approximately 5043 different genes, or about one-third of the predicted genes in the D. melanogaster genome, are represented in the dataset and 1854 (or 37%) of these genes are un-annotated.
39 Gene Ontology Biological Process (GO-BP) categories were found with precision value equal or larger than 0.75, when recall was fixed at the 0.4 level. For two of those categories, we have provided additional support for assigning given genes to the category by showing that the majority of transcripts for the genes belonging in a given category have a similar localization pattern during embryogenesis. Additionally, by assessing the predictions using a confidence score, we have been able to provide a putative GO-BP term for 1422 previously un-annotated genes or about 77% of the un-annotated genes represented on the microarray and about 19% of all of the un-annotated genes in the D. melanogaster genome.
Our study successfully employs a number of SVM classifiers, accompanied by detailed calibration and validation techniques, to generate a number of predictions for new annotations for D. melanogaster genes. The applied probabilistic analysis to SVM output improves the interpretability of the prediction results and the objectivity of the validation procedure.
尽管已经对数百种生物的基因组进行了测序,并且已经有了很好的方法来寻找蛋白质编码基因,但一个重要的遗留挑战是预测那些没有注释的大量基因的功能。已经存在来自微阵列实验的大型基因表达数据集,并且其中许多可以用于帮助为这些基因赋予潜在的功能。我们已经应用支持向量机 (SVM)、sigmoid 拟合函数和分层交叉验证方法来分析来自黑腹果蝇的大型微阵列实验数据集,以预测以前未注释的基因的可能功能。在数据集中总共代表了大约 5043 个不同的基因,约占黑腹果蝇基因组中预测基因的三分之一,其中 1854 个(或 37%)未注释。
当召回率固定在 0.4 时,发现了 39 个具有等于或大于 0.75 的精度值的基因本体论生物过程 (GO-BP) 类别。对于其中两个类别,我们通过表明属于给定类别的大多数基因的转录物在胚胎发生期间具有相似的定位模式,为将给定基因分配给该类别的提供了额外的支持。此外,通过使用置信度评分评估预测,我们能够为 1422 个以前未注释的基因(约占微阵列上表示的未注释基因的 77%)或约占黑腹果蝇基因组中所有未注释基因的 19%提供一个可能的 GO-BP 术语。
我们的研究成功地采用了许多 SVM 分类器,同时辅以详细的校准和验证技术,为黑腹果蝇基因的新注释生成了许多预测。对 SVM 输出应用概率分析提高了预测结果的可解释性和验证过程的客观性。