Faculty of Computer Science, Department of Data Analysis and Artificial Intelligence, Moscow, Russia.
Faculty of Computer Science, Department of Big Data and Information Retrieval, Moscow, Russia.
Bioinformatics. 2018 Oct 1;34(19):3281-3288. doi: 10.1093/bioinformatics/bty349.
Bioinformatics studies often rely on similarity measures between sequence pairs, which often pose a bottleneck in large-scale sequence analysis.
Here, we present a new convolutional kernel function for protein sequences called the Lempel-Ziv-Welch (LZW)-Kernel. It is based on code words identified with the LZW universal text compressor. The LZW-Kernel is an alignment-free method, it is always symmetric, is positive, always provides 1.0 for self-similarity and it can directly be used with Support Vector Machines (SVMs) in classification problems, contrary to normalized compression distance, which often violates the distance metric properties in practice and requires further techniques to be used with SVMs. The LZW-Kernel is a one-pass algorithm, which makes it particularly plausible for big data applications. Our experimental studies on remote protein homology detection and protein classification tasks reveal that the LZW-Kernel closely approaches the performance of the Local Alignment Kernel (LAK) and the SVM-pairwise method combined with Smith-Waterman (SW) scoring at a fraction of the time. Moreover, the LZW-Kernel outperforms the SVM-pairwise method when combined with Basic Local Alignment Search Tool (BLAST) scores, which indicates that the LZW code words might be a better basis for similarity measures than local alignment approximations found with BLAST. In addition, the LZW-Kernel outperforms n-gram based mismatch kernels, hidden Markov model based SAM and Fisher kernel and protein family based PSI-BLAST, among others. Further advantages include the LZW-Kernel's reliance on a simple idea, its ease of implementation, and its high speed, three times faster than BLAST and several magnitudes faster than SW or LAK in our tests.
LZW-Kernel is implemented as a standalone C code and is a free open-source program distributed under GPLv3 license and can be downloaded from https://github.com/kfattila/LZW-Kernel.
Supplementary data are available at Bioinformatics Online.
生物信息学研究通常依赖于序列对之间的相似性度量,这在大规模序列分析中经常是一个瓶颈。
在这里,我们提出了一种新的蛋白质序列卷积核函数,称为 Lempel-Ziv-Welch(LZW)-Kernel。它基于 LZW 通用文本压缩器识别的码字。LZW-Kernel 是一种无比对方法,它始终是对称的、正值的,对于自相似性总是提供 1.0,并且可以直接与支持向量机(SVM)一起用于分类问题,与归一化压缩距离相反,后者在实践中经常违反距离度量属性,需要进一步的技术与 SVM 一起使用。LZW-Kernel 是一种单遍算法,这使得它特别适用于大数据应用。我们在远程蛋白质同源性检测和蛋白质分类任务上的实验研究表明,LZW-Kernel 在时间的一小部分接近局部比对核(LAK)和 SVM 成对方法与 Smith-Waterman(SW)评分相结合的性能。此外,当与基本局部比对搜索工具(BLAST)评分结合使用时,LZW-Kernel 优于 SVM 成对方法,这表明 LZW 码字可能比 BLAST 找到的局部比对近似值更适合作为相似性度量的基础。此外,LZW-Kernel 优于基于 n 元组的错配核、基于隐马尔可夫模型的 SAM 和 Fisher 核以及基于蛋白质家族的 PSI-BLAST 等。进一步的优势包括 LZW-Kernel 依赖于一个简单的想法、易于实现以及高速,比 BLAST 快三倍,比 SW 或 LAK 快几个数量级在我们的测试中。
LZW-Kernel 作为一个独立的 C 代码实现,是一个免费的开源程序,根据 GPLv3 许可证分发,可以从 https://github.com/kfattila/LZW-Kernel 下载。
补充数据可在生物信息学在线获得。