Nojoomi Saghi, Koehl Patrice
Biotechnology program, University of California, Davis, 1, Shields Avenue, Davis, CA, 95616, USA.
Department of Computer Science and Genome Center, 1, Shields Avenue, Davis, CA, 95616, USA.
BMC Bioinformatics. 2017 Feb 28;18(1):137. doi: 10.1186/s12859-017-1560-9.
The amino acid sequence of a protein is the blueprint from which its structure and ultimately function can be derived. Therefore, sequence comparison methods remain essential for the determination of similarity between proteins. Traditional approaches for comparing two protein sequences begin with strings of letters (amino acids) that represent the sequences, before generating textual alignments between these strings and providing scores for each alignment. When the similitude between the two protein sequences to be compared is low however, the quality of the corresponding sequence alignment is usually poor, leading to poor performance for the recognition of similarity.
In this study, we develop an alignment free alternative to these methods that is based on the concept of string kernels. Starting from recently proposed kernels on the discrete space of protein sequences (Shen et al, Found. Comput. Math., 2013,14:951-984), we introduce our own version, SeqKernel. Its implementation depends on two parameters, a coefficient that tunes the substitution matrix and the maximum length of k-mers that it includes. We provide an exhaustive analysis of the impacts of these two parameters on the performance of SeqKernel for fold recognition. We show that with the right choice of parameters, use of the SeqKernel similarity measure improves fold recognition compared to the use of traditional alignment-based methods. We illustrate the application of SeqKernel to inferring phylogeny on RNA polymerases and show that it performs as well as methods based on multiple sequence alignments.
We have presented and characterized a new alignment free method based on a mathematical kernel for scoring the similarity of protein sequences. We discuss possible improvements of this method, as well as an extension of its applications to other modeling methods that rely on sequence comparison.
蛋白质的氨基酸序列是推导其结构及最终功能的蓝图。因此,序列比较方法对于确定蛋白质间的相似性仍然至关重要。传统的比较两个蛋白质序列的方法,首先是处理代表序列的字母串(氨基酸),然后在这些字符串之间生成文本比对,并为每个比对提供分数。然而,当要比较的两个蛋白质序列之间的相似性较低时,相应序列比对的质量通常较差,导致相似性识别的性能不佳。
在本研究中,我们开发了一种基于字符串核概念的、无需比对的替代方法。从最近在蛋白质序列离散空间上提出的核(Shen等人,《基础计算数学》,2013年,14:951 - 984)出发,我们引入了自己的版本SeqKernel。它的实现依赖于两个参数,一个用于调整替换矩阵的系数和它所包含的k - 元组的最大长度。我们详尽分析了这两个参数对SeqKernel进行折叠识别性能的影响。我们表明,通过正确选择参数,与使用传统的基于比对的方法相比,使用SeqKernel相似性度量可提高折叠识别能力。我们举例说明了SeqKernel在推断RNA聚合酶系统发育方面的应用,并表明它的表现与基于多序列比对的方法相当。
我们提出并描述了一种基于数学核的用于评估蛋白质序列相似性的新的无需比对的方法。我们讨论了该方法可能的改进,以及将其应用扩展到其他依赖序列比较的建模方法。