Nanni Loris, Lumini Alessandra
DEIS, IEIIT-CNR, Università di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy.
Amino Acids. 2008 May;34(4):635-41. doi: 10.1007/s00726-007-0016-3. Epub 2008 Jan 4.
Given a novel protein it is very important to know if it is a DNA-binding protein, because DNA-binding proteins participate in the fundamental role to regulate gene expression. In this work, we propose a parallel fusion between a classifier trained using the features extracted from the gene ontology database and a classifier trained using the dipeptide composition of the protein. As classifiers the support vector machine (SVM) and the 1-nearest neighbour are used. Matthews's correlation coefficient obtained by our fusion method is approximately 0.97 when the jackknife cross-validation is used; this result outperforms the best performance obtained in the literature (0.924) using the same dataset where the SVM is trained using only the Chou's pseudo amino acid based features. In this work also the area under the ROC-curve (AUC) is reported and our results show that the fusion permits to obtain a very interesting 0.995 AUC. In particular we want to stress that our fusion obtains a 5% false negative with a 0% of false positive. Matthews's correlation coefficient obtained using the single best GO-number is only 0.7211 and hence it is not possible to use the gene ontology database as a simple lookup table. Finally, we test the complementarity of the two tested feature extraction methods using the Q-statistic. We obtain the very interesting result of 0.58, which means that the features extracted from the gene ontology database and the features extracted from the amino acid sequence are partially independent and that their parallel fusion should be studied more.
对于一种新的蛋白质,了解它是否为DNA结合蛋白非常重要,因为DNA结合蛋白在调节基因表达的基本作用中发挥着作用。在这项工作中,我们提出了一种并行融合方法,将使用从基因本体数据库中提取的特征训练的分类器与使用蛋白质二肽组成训练的分类器进行融合。分类器采用支持向量机(SVM)和1-最近邻算法。当使用留一法交叉验证时,我们的融合方法得到的马修斯相关系数约为0.97;该结果优于文献中使用相同数据集获得的最佳性能(0.924),在文献中SVM仅使用基于周的伪氨基酸特征进行训练。在这项工作中,还报告了ROC曲线下面积(AUC),我们的结果表明,融合能够获得非常可观的0.995的AUC。特别要强调的是,我们的融合方法得到的假阴性率为5%,假阳性率为0%。使用单个最佳GO编号获得的马修斯相关系数仅为0.7211,因此不可能将基因本体数据库用作简单的查找表。最后,我们使用Q统计量测试了两种测试特征提取方法的互补性。我们得到了非常可观的0.58的结果,这意味着从基因本体数据库中提取的特征和从氨基酸序列中提取的特征部分独立,它们的并行融合值得进一步研究。