Kuang Rui, Ie Eugene, Wang Ke, Wang Kai, Siddiqi Mahira, Freund Yoav, Leslie Christina
Department of Computer Science, Columbia University, New York, NY 10027, USA.
Proc IEEE Comput Syst Bioinform Conf. 2004:152-60. doi: 10.1109/csb.2004.1332428.
We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the pro- files is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" -- short regions of the original profile that contribute almost all the weight of the SVM classification score -- and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results are comparable to cluster kernels while providing much better scalability to large datasets.
我们引入了基于轮廓的新型字符串核,用于支持向量机(SVM)来解决蛋白质分类和远程同源性检测问题。这些核使用概率轮廓,例如由PSI-BLAST算法产生的轮廓,来定义沿着蛋白质序列的位置依赖突变邻域,以便在数据中对k长度子序列(“k-mer”)进行不精确匹配。通过使用高效的数据结构,一旦获得轮廓,核的计算速度就很快。例如,运行PSI-BLAST以构建轮廓所需的时间明显长于核计算时间和SVM训练时间。我们展示了基于SCOP数据库的远程同源性检测实验,结果表明与SVM分类器一起使用的基于轮廓的字符串核明显优于最近提出的所有监督SVM方法。我们还展示了如何使用学习到的SVM分类器来提取“判别性序列基序”——原始轮廓中的短区域,这些区域几乎贡献了SVM分类分数的所有权重——并表明这些判别性基序对应于蛋白质数据中有意义的结构特征。使用PSI-BLAST轮廓可以看作是一种半监督学习技术,因为PSI-BLAST利用来自大型序列数据库的未标记数据来构建更具信息性的轮廓。最近提出的“聚类核”给出了用于提高SVM蛋白质分类性能的通用半监督方法。我们表明,我们的轮廓核结果与聚类核相当,同时对大型数据集具有更好的可扩展性。