• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

LZW-Kernel:快速内核,利用 LZW 压缩器中的变长码块对蛋白质序列进行分类。

LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification.

机构信息

Faculty of Computer Science, Department of Data Analysis and Artificial Intelligence, Moscow, Russia.

Faculty of Computer Science, Department of Big Data and Information Retrieval, Moscow, Russia.

出版信息

Bioinformatics. 2018 Oct 1;34(19):3281-3288. doi: 10.1093/bioinformatics/bty349.

DOI:10.1093/bioinformatics/bty349
PMID:29741583
Abstract

MOTIVATION

Bioinformatics studies often rely on similarity measures between sequence pairs, which often pose a bottleneck in large-scale sequence analysis.

RESULTS

Here, we present a new convolutional kernel function for protein sequences called the Lempel-Ziv-Welch (LZW)-Kernel. It is based on code words identified with the LZW universal text compressor. The LZW-Kernel is an alignment-free method, it is always symmetric, is positive, always provides 1.0 for self-similarity and it can directly be used with Support Vector Machines (SVMs) in classification problems, contrary to normalized compression distance, which often violates the distance metric properties in practice and requires further techniques to be used with SVMs. The LZW-Kernel is a one-pass algorithm, which makes it particularly plausible for big data applications. Our experimental studies on remote protein homology detection and protein classification tasks reveal that the LZW-Kernel closely approaches the performance of the Local Alignment Kernel (LAK) and the SVM-pairwise method combined with Smith-Waterman (SW) scoring at a fraction of the time. Moreover, the LZW-Kernel outperforms the SVM-pairwise method when combined with Basic Local Alignment Search Tool (BLAST) scores, which indicates that the LZW code words might be a better basis for similarity measures than local alignment approximations found with BLAST. In addition, the LZW-Kernel outperforms n-gram based mismatch kernels, hidden Markov model based SAM and Fisher kernel and protein family based PSI-BLAST, among others. Further advantages include the LZW-Kernel's reliance on a simple idea, its ease of implementation, and its high speed, three times faster than BLAST and several magnitudes faster than SW or LAK in our tests.

AVAILABILITY AND IMPLEMENTATION

LZW-Kernel is implemented as a standalone C code and is a free open-source program distributed under GPLv3 license and can be downloaded from https://github.com/kfattila/LZW-Kernel.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics Online.

摘要

动机

生物信息学研究通常依赖于序列对之间的相似性度量,这在大规模序列分析中经常是一个瓶颈。

结果

在这里,我们提出了一种新的蛋白质序列卷积核函数,称为 Lempel-Ziv-Welch(LZW)-Kernel。它基于 LZW 通用文本压缩器识别的码字。LZW-Kernel 是一种无比对方法,它始终是对称的、正值的,对于自相似性总是提供 1.0,并且可以直接与支持向量机(SVM)一起用于分类问题,与归一化压缩距离相反,后者在实践中经常违反距离度量属性,需要进一步的技术与 SVM 一起使用。LZW-Kernel 是一种单遍算法,这使得它特别适用于大数据应用。我们在远程蛋白质同源性检测和蛋白质分类任务上的实验研究表明,LZW-Kernel 在时间的一小部分接近局部比对核(LAK)和 SVM 成对方法与 Smith-Waterman(SW)评分相结合的性能。此外,当与基本局部比对搜索工具(BLAST)评分结合使用时,LZW-Kernel 优于 SVM 成对方法,这表明 LZW 码字可能比 BLAST 找到的局部比对近似值更适合作为相似性度量的基础。此外,LZW-Kernel 优于基于 n 元组的错配核、基于隐马尔可夫模型的 SAM 和 Fisher 核以及基于蛋白质家族的 PSI-BLAST 等。进一步的优势包括 LZW-Kernel 依赖于一个简单的想法、易于实现以及高速,比 BLAST 快三倍,比 SW 或 LAK 快几个数量级在我们的测试中。

可用性和实现

LZW-Kernel 作为一个独立的 C 代码实现,是一个免费的开源程序,根据 GPLv3 许可证分发,可以从 https://github.com/kfattila/LZW-Kernel 下载。

补充信息

补充数据可在生物信息学在线获得。

相似文献

1
LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification.LZW-Kernel:快速内核,利用 LZW 压缩器中的变长码块对蛋白质序列进行分类。
Bioinformatics. 2018 Oct 1;34(19):3281-3288. doi: 10.1093/bioinformatics/bty349.
2
SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.支持向量机折叠法:一种用于判别式多类别蛋白质折叠和超家族识别的工具。
BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.
3
Fast model-based protein homology detection without alignment.基于快速模型的无需比对的蛋白质同源性检测。
Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.
4
FastSK: fast sequence analysis with gapped string kernels.FastSK:使用带间隙字符串核的快速序列分析。
Bioinformatics. 2020 Dec 30;36(Suppl_2):i857-i865. doi: 10.1093/bioinformatics/btaa817.
5
Profile-based string kernels for remote homology detection and motif extraction.基于轮廓的字符串核用于远程同源性检测和基序提取。
J Bioinform Comput Biol. 2005 Jun;3(3):527-50. doi: 10.1142/s021972000500120x.
6
Mismatch string kernels for discriminative protein classification.用于判别式蛋白质分类的错配字符串核
Bioinformatics. 2004 Mar 1;20(4):467-76. doi: 10.1093/bioinformatics/btg431. Epub 2004 Jan 22.
7
Protein homology detection using string alignment kernels.使用字符串比对核进行蛋白质同源性检测。
Bioinformatics. 2004 Jul 22;20(11):1682-9. doi: 10.1093/bioinformatics/bth141. Epub 2004 Feb 26.
8
Application of latent semantic analysis to protein remote homology detection.潜在语义分析在蛋白质远程同源性检测中的应用。
Bioinformatics. 2006 Feb 1;22(3):285-90. doi: 10.1093/bioinformatics/bti801. Epub 2005 Nov 29.
9
Optimizing amino acid substitution matrices with a local alignment kernel.使用局部比对核优化氨基酸替换矩阵。
BMC Bioinformatics. 2006 May 5;7:246. doi: 10.1186/1471-2105-7-246.
10
Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.通过通用相似性度量对生物序列和结构进行基于压缩的分类:实验评估
BMC Bioinformatics. 2007 Jul 13;8:252. doi: 10.1186/1471-2105-8-252.

引用本文的文献

1
A Review of Methods for Estimating Algorithmic Complexity: Options, Challenges, and New Directions.算法复杂度估计方法综述:选项、挑战与新方向
Entropy (Basel). 2020 May 30;22(6):612. doi: 10.3390/e22060612.
2
Caretta - A multiple protein structure alignment and feature extraction suite.Caretta - 一个多蛋白结构比对与特征提取套件。
Comput Struct Biotechnol J. 2020 Apr 6;18:981-992. doi: 10.1016/j.csbj.2020.03.011. eCollection 2020.
3
Benchmarking of alignment-free sequence comparison methods.无比对信息的序列比较方法的基准测试。
Genome Biol. 2019 Jul 25;20(1):144. doi: 10.1186/s13059-019-1755-7.