He Jieyue, Hu Hae-Jin, Harrison Robert, Tai Phang C, Pan Yi
Computer Science and Engineering Department, Southeast University, Nanjing 210096, China.
IEEE Trans Nanobioscience. 2006 Mar;5(1):46-53. doi: 10.1109/tnb.2005.864021.
Support vector machines (SVMs) have shown strong generalization ability in a number of application areas, including protein structure prediction. However, the poor comprehensibility hinders the success of the SVM for protein structure prediction. The explanation of how a decision made is important for accepting the machine learning technology, especially for applications such as bioinformatics. The reasonable interpretation is not only useful to guide the "wet experiments," but also the extracted rules are helpful to integrate computational intelligence with symbolic AI systems for advanced deduction. On the other hand, a decision tree has good comprehensibility. In this paper, a novel approach to rule generation for protein secondary structure prediction by integrating merits of both the SVM and decision tree is presented. This approach combines the SVM with decision tree into a new algorithm called SVM_ DT, which proceeds in three steps. This algorithm first trains an SVM. Then, a new training set is generated through careful selection from the output of the SVM. Finally, the obtained training set is used to train a decision tree learning system and to extract the corresponding rule sets. The results of the experiments of protein secondary structure prediction on RS126 data set show that the comprehensibility of SVM_DT is much better than that of the SVM. Moreover, the generalization ability of SVM_DT is better than that of C4.5 decision trees and is similar to that of the SVM. Hence, SVM_DT can be used not only for prediction, but also for guiding biological experiments.
支持向量机(SVM)在包括蛋白质结构预测在内的许多应用领域都表现出了强大的泛化能力。然而,其较差的可解释性阻碍了SVM在蛋白质结构预测方面的成功应用。对于如何做出决策的解释对于接受机器学习技术至关重要,尤其是在生物信息学等应用领域。合理的解释不仅有助于指导“湿实验”,而且提取的规则有助于将计算智能与符号人工智能系统集成以进行高级推理。另一方面,决策树具有良好的可解释性。本文提出了一种通过整合SVM和决策树的优点来生成蛋白质二级结构预测规则的新方法。该方法将SVM与决策树结合成一种名为SVM_DT的新算法,该算法分三步进行。该算法首先训练一个SVM。然后,通过从SVM的输出中仔细选择来生成一个新的训练集。最后,使用获得的训练集来训练决策树学习系统并提取相应的规则集。在RS126数据集上进行蛋白质二级结构预测的实验结果表明,SVM_DT的可解释性比SVM好得多。此外,SVM_DT的泛化能力优于C4.5决策树,与SVM相似。因此,SVM_DT不仅可用于预测,还可用于指导生物学实验。