基于轮廓的字符串核用于远程同源性检测和基序提取。

Profile-based string kernels for remote homology detection and motif extraction.

作者信息

Kuang Rui, Ie Eugene, Wang Ke, Wang Kai, Siddiqi Mahira, Freund Yoav, Leslie Christina

机构信息

Department of Computer Science, Columbia University, New York, NY 10027, USA.

出版信息

Proc IEEE Comput Syst Bioinform Conf. 2004:152-60. doi: 10.1109/csb.2004.1332428.

DOI:10.1109/csb.2004.1332428

PMID:16448009

Abstract

We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the pro- files is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" -- short regions of the original profile that contribute almost all the weight of the SVM classification score -- and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results are comparable to cluster kernels while providing much better scalability to large datasets.

摘要

我们引入了基于轮廓的新型字符串核，用于支持向量机（SVM）来解决蛋白质分类和远程同源性检测问题。这些核使用概率轮廓，例如由PSI-BLAST算法产生的轮廓，来定义沿着蛋白质序列的位置依赖突变邻域，以便在数据中对k长度子序列（“k-mer”）进行不精确匹配。通过使用高效的数据结构，一旦获得轮廓，核的计算速度就很快。例如，运行PSI-BLAST以构建轮廓所需的时间明显长于核计算时间和SVM训练时间。我们展示了基于SCOP数据库的远程同源性检测实验，结果表明与SVM分类器一起使用的基于轮廓的字符串核明显优于最近提出的所有监督SVM方法。我们还展示了如何使用学习到的SVM分类器来提取“判别性序列基序”——原始轮廓中的短区域，这些区域几乎贡献了SVM分类分数的所有权重——并表明这些判别性基序对应于蛋白质数据中有意义的结构特征。使用PSI-BLAST轮廓可以看作是一种半监督学习技术，因为PSI-BLAST利用来自大型序列数据库的未标记数据来构建更具信息性的轮廓。最近提出的“聚类核”给出了用于提高SVM蛋白质分类性能的通用半监督方法。我们表明，我们的轮廓核结果与聚类核相当，同时对大型数据集具有更好的可扩展性。

相似文献

Profile-based string kernels for remote homology detection and motif extraction.基于轮廓的字符串核用于远程同源性检测和基序提取。

Proc IEEE Comput Syst Bioinform Conf. 2004:152-60. doi: 10.1109/csb.2004.1332428.

Profile-based string kernels for remote homology detection and motif extraction.基于轮廓的字符串核用于远程同源性检测和基序提取。

J Bioinform Comput Biol. 2005 Jun;3(3):527-50. doi: 10.1142/s021972000500120x.

SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.支持向量机折叠法：一种用于判别式多类别蛋白质折叠和超家族识别的工具。

BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.

Mismatch string kernels for discriminative protein classification.用于判别式蛋白质分类的错配字符串核

Bioinformatics. 2004 Mar 1;20(4):467-76. doi: 10.1093/bioinformatics/btg431. Epub 2004 Jan 22.

Protein homology detection using string alignment kernels.使用字符串比对核进行蛋白质同源性检测。

Bioinformatics. 2004 Jul 22;20(11):1682-9. doi: 10.1093/bioinformatics/bth141. Epub 2004 Feb 26.

SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection.SVM-HUSTLE——一种用于成对蛋白质远程同源性检测的迭代半监督机器学习方法。

Bioinformatics. 2008 Mar 15;24(6):783-90. doi: 10.1093/bioinformatics/btn028. Epub 2008 Feb 1.

Support vector machines with profile-based kernels for remote protein homology detection.用于远程蛋白质同源性检测的基于轮廓核的支持向量机。

Genome Inform. 2004;15(2):191-200.

Application of string kernels in protein sequence classification.字符串核在蛋白质序列分类中的应用。

Appl Bioinformatics. 2005;4(1):45-52. doi: 10.2165/00822942-200504010-00005.

Application of latent semantic analysis to protein remote homology detection.潜在语义分析在蛋白质远程同源性检测中的应用。

Bioinformatics. 2006 Feb 1;22(3):285-90. doi: 10.1093/bioinformatics/bti801. Epub 2005 Nov 29.

Remote protein homology detection and fold recognition using two-layer support vector machine classifiers.使用两层支持向量机分类器进行远程蛋白质同源检测和折叠识别。

Comput Biol Med. 2011 Aug;41(8):687-99. doi: 10.1016/j.compbiomed.2011.06.004. Epub 2011 Jun 25.

引用本文的文献

Building blocks and blueprints for bacterial autolysins.细菌自溶素的构建模块和蓝图。

PLoS Comput Biol. 2021 Apr 1;17(4):e1008889. doi: 10.1371/journal.pcbi.1008889. eCollection 2021 Apr.

Computational prediction shines light on type III secretion origins.计算预测揭示了 III 型分泌系统的起源。

Sci Rep. 2016 Oct 7;6:34516. doi: 10.1038/srep34516.

LocTree3 prediction of localization.LocTree3 定位预测。

Nucleic Acids Res. 2014 Jul;42(Web Server issue):W350-5. doi: 10.1093/nar/gku396. Epub 2014 May 21.

LocTree2 predicts localization for all domains of life.LocTree2 可预测所有生命领域的定位。

Bioinformatics. 2012 Sep 15;28(18):i458-i465. doi: 10.1093/bioinformatics/bts390.

Exploiting physico-chemical properties in string kernels.利用字符串核中的物理化学性质。

BMC Bioinformatics. 2010 Oct 26;11 Suppl 8(Suppl 8):S7. doi: 10.1186/1471-2105-11-S8-S7.

Machine learning based prediction for peptide drift times in ion mobility spectrometry.基于机器学习的离子淌度质谱中肽漂移时间预测。

Bioinformatics. 2010 Jul 1;26(13):1601-7. doi: 10.1093/bioinformatics/btq245. Epub 2010 May 21.

Efficient alignment-free DNA barcode analytics.高效的无比对 DNA 条码分析。

BMC Bioinformatics. 2009 Nov 10;10 Suppl 14(Suppl 14):S9. doi: 10.1186/1471-2105-10-S14-S9.

Efficient use of unlabeled data for protein sequence classification: a comparative study.蛋白质序列分类中未标记数据的高效利用：一项比较研究。

BMC Bioinformatics. 2009 Apr 29;10 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-10-S4-S2.

A new prediction strategy for long local protein structures using an original description.一种使用原始描述的长局部蛋白质结构的新预测策略。

Proteins. 2009 Aug 15;76(3):570-87. doi: 10.1002/prot.22370.

MiRTif: a support vector machine-based microRNA target interaction filter.MiRTif：一种基于支持向量机的微小RNA靶标相互作用筛选工具

BMC Bioinformatics. 2008 Dec 12;9 Suppl 12(Suppl 12):S4. doi: 10.1186/1471-2105-9-S12-S4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于轮廓的字符串核用于远程同源性检测和基序提取。

Profile-based string kernels for remote homology detection and motif extraction.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献