Nanni Loris, Lumini Alessandra
DEIS, IEIIT--CNR, Università di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy.
Amino Acids. 2009 Feb;36(2):167-75. doi: 10.1007/s00726-008-0044-7. Epub 2008 Feb 21.
It is well known in the literature that an ensemble of classifiers obtains good performance with respect to that obtained by a stand-alone method. Hence, it is very important to develop ensemble methods well suited for bioinformatics data. In this work, we propose to combine the feature extraction method based on grouped weight with a set of amino-acid alphabets obtained by a Genetic Algorithm. The proposed method is applied for predicting DNA-binding proteins. As classifiers, the linear support vector machine and the radial basis function support vector machine are tested. As performance indicators, the accuracy and Matthews's correlation coefficient are reported. Matthews's correlation coefficient obtained by our ensemble method is approximately 0.97 when the jackknife cross-validation is used. This result outperforms the performance obtained in the literature using the same dataset where the features are extracted directly from the amino-acid sequence.
文献中众所周知,分类器集成相对于单独方法所获得的性能表现良好。因此,开发非常适合生物信息学数据的集成方法非常重要。在这项工作中,我们建议将基于分组权重的特征提取方法与通过遗传算法获得的一组氨基酸字母表相结合。所提出的方法用于预测DNA结合蛋白。作为分类器,测试了线性支持向量机和径向基函数支持向量机。作为性能指标,报告了准确率和马修斯相关系数。当使用留一法交叉验证时,我们的集成方法获得的马修斯相关系数约为0.97。该结果优于在使用相同数据集的文献中所获得的性能,在该文献中特征是直接从氨基酸序列中提取的。