BioInfomatics Institute, 30 Biopolis Street, #07-01 Matrix, Singapore.
IEEE/ACM Trans Comput Biol Bioinform. 2011 May-Jun;8(3):858-64. doi: 10.1109/TCBB.2010.16.
Although numerous computational techniques have been applied to predict protein secondary structure (PSS), only limited studies have dealt with discovery of logic rules underlying the prediction itself. Such rules offer interesting links between the prediction model and the underlying biology. In addition, they enhance interpretability of PSS prediction by providing a degree of transparency to the predicting model usually regarded as a black box. In this paper, we explore the generation and use of C4.5 decision trees to extract relevant rules from PSS predictions modeled with two-stage support vector machines (TS-SVM). The proposed rules were derived on the RS126 data set of 126 nonhomologous globular proteins and on the PSIPRED data set of 1,923 protein sequences. Our approach has produced sets of comprehensible, and often interpretable, rules underlying the PSS predictions. Moreover, many of the rules seem to be strongly supported by biological evidence. Further, our approach resulted in good prediction accuracy, few and usually compact rules, and rules that are generally of higher confidence levels than those generated by other rule extraction techniques.
虽然已经有许多计算技术被应用于预测蛋白质二级结构(PSS),但只有有限的研究涉及发现预测本身背后的逻辑规则。这些规则在预测模型和基础生物学之间提供了有趣的联系。此外,它们通过为通常被视为黑盒的预测模型提供一定程度的透明度来增强 PSS 预测的可解释性。在本文中,我们探讨了使用 C4.5 决策树从使用两阶段支持向量机(TS-SVM)建模的 PSS 预测中提取相关规则。所提出的规则是基于 RS126 数据集的 126 个非同源球状蛋白质和 PSIPRED 数据集的 1923 个蛋白质序列得出的。我们的方法产生了一组可理解且通常可解释的 PSS 预测背后的规则。此外,许多规则似乎得到了生物学证据的有力支持。此外,我们的方法还实现了良好的预测准确性、规则数量少且通常紧凑、以及规则的置信度水平通常高于其他规则提取技术生成的规则。