使用压缩氨基酸字母表在线性时间内进行局部同源性识别和距离测量。

Local homology recognition and distance measures in linear time using compressed amino acid alphabets.

作者信息

Edgar Robert C

出版信息

Nucleic Acids Res. 2004 Jan 16;32(1):380-5. doi: 10.1093/nar/gkh180. Print 2004.

DOI:10.1093/nar/gkh180

PMID:14729922

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC373290/

Abstract

Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments.

摘要

描述了通过识别两个序列共有的k-mer（长度为k的连续子序列）来发现局部相似性和估计进化距离的方法。对于长度为L的未比对序列，这些方法具有O(L)的时间复杂度。研究了压缩氨基酸字母表将这些技术扩展到远缘相关蛋白质的能力。使用从FSSP数据库中选择的1848对结构可比对序列的测试集，针对不同的字母表和k的选择评估了这些算法的性能。发现从k-mer计数得出的距离度量与从序列比对得出的百分比同一性密切相关。压缩字母表在局部相似性发现中提高了性能，但在应用于距离估计时未发现性能提升的证据。将我们的局部相似性发现方法的性能与MAFFT中使用的快速傅里叶变换（FFT）进行了比较，后者具有O(L log L)的时间复杂度。这里揭示了实现与FFT相当覆盖范围的方法，并且速度快了一个多数量级。我们建议使用k-mer距离进行快速、近似的系统发育树构建，并表明相对于需要比对的标准距离方法，可以实现超过三个数量级的速度提升。

相似文献

Local homology recognition and distance measures in linear time using compressed amino acid alphabets.使用压缩氨基酸字母表在线性时间内进行局部同源性识别和距离测量。

Nucleic Acids Res. 2004 Jan 16;32(1):380-5. doi: 10.1093/nar/gkh180. Print 2004.

On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

Fast model-based protein homology detection without alignment.基于快速模型的无需比对的蛋白质同源性检测。

Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.SATe-II：一种非常快速且准确的同时估计多个序列比对和系统发育树的方法。

Syst Biol. 2012 Jan;61(1):90-106. doi: 10.1093/sysbio/syr095. Epub 2011 Dec 1.

OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy.OXBench：一种用于评估蛋白质多序列比对准确性的基准。

BMC Bioinformatics. 2003 Oct 10;4:47. doi: 10.1186/1471-2105-4-47.

Scoredist: a simple and robust protein sequence distance estimator.Scoredist：一种简单且强大的蛋白质序列距离估计器。

BMC Bioinformatics. 2005 Apr 27;6:108. doi: 10.1186/1471-2105-6-108.

Periodic distributions of hydrophobic amino acids allows the definition of fundamental building blocks to align distantly related proteins.疏水性氨基酸的周期性分布有助于定义基本构建模块，从而比对远缘相关的蛋白质。

Proteins. 2007 May 15;67(3):695-708. doi: 10.1002/prot.21319.

Combination of threading potentials and sequence profiles improves fold recognition.穿线势能与序列谱相结合可提高折叠识别能力。

J Mol Biol. 2000 Mar 10;296(5):1319-31. doi: 10.1006/jmbi.2000.3541.

transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.transAlign：利用氨基酸促进蛋白质编码DNA序列的多重比对。

BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.

A configuration space of homologous proteins conserving mutual information and allowing a phylogeny inference based on pair-wise Z-score probabilities.同源蛋白质的一种构象空间，其保留互信息并允许基于成对Z分数概率进行系统发育推断。

BMC Bioinformatics. 2005 Mar 10;6:49. doi: 10.1186/1471-2105-6-49.

引用本文的文献

Analysis of Composition, Structure, and Driving Factors of Root-Associated Endophytic Bacterial Communities of the Chinese Medicinal Herb .中药根际内生细菌群落的组成、结构及驱动因素分析

Biology (Basel). 2025 Jul 15;14(7):856. doi: 10.3390/biology14070856.

Diamonds in the rif: Alignment-free comparative genomics analysis reveals strain-transcendent Plasmodium falciparum antigens amidst extensive genetic diversity.疟疾抗性中的钻石：无比对比较基因组学分析揭示了在广泛遗传多样性中超越菌株的恶性疟原虫抗原。

Infect Genet Evol. 2025 Apr;129:105725. doi: 10.1016/j.meegid.2025.105725. Epub 2025 Feb 5.

A comparison of various feature extraction and machine learning methods for antimicrobial resistance prediction in .用于抗菌药物耐药性预测的各种特征提取和机器学习方法的比较。（原文句末不完整，推测补充完整后可能是“在……中的抗菌药物耐药性预测”）

Front Antibiot. 2023 Mar 24;2:1126468. doi: 10.3389/frabi.2023.1126468. eCollection 2023.

The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction.多重序列比对在分子结构与功能预测中的历史演变及意义

Biomolecules. 2024 Nov 29;14(12):1531. doi: 10.3390/biom14121531.

An alignment-free method for detection of missing regions for phylogenetic analysis.一种用于系统发育分析中缺失区域检测的无比对方法。

Heliyon. 2024 Jun 4;10(11):e32227. doi: 10.1016/j.heliyon.2024.e32227. eCollection 2024 Jun 15.

Portable BLAST-like algorithm library and its implementations for command line, Python, and R.可移植的 BLAST 样算法库及其在命令行、Python 和 R 中的实现。

PLoS One. 2023 Nov 30;18(11):e0289693. doi: 10.1371/journal.pone.0289693. eCollection 2023.

Whole genome sequencing-based identification of human tuberculosis caused by animal-lineage .基于全基因组测序的动物源性人结核分枝杆菌的鉴定

J Clin Microbiol. 2023 Nov 21;61(11):e0026023. doi: 10.1128/jcm.00260-23. Epub 2023 Oct 25.

On closing the inopportune gap with consistency transformation and iterative refinement.以一致性变换和迭代细化来弥合不合时宜的差距。

PLoS One. 2023 Jul 13;18(7):e0287483. doi: 10.1371/journal.pone.0287483. eCollection 2023.

Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2.使用 Dashing 2 进行多重性和位置敏感哈希的基因组草图绘制。

Genome Res. 2023 Jul;33(7):1218-1227. doi: 10.1101/gr.277655.123. Epub 2023 Jul 6.

Protein-to-genome alignment with miniprot.用 Miniprot 进行蛋白质到基因组的比对。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad014.

本文引用的文献

Reduction of protein sequence complexity by residue grouping.通过残基分组降低蛋白质序列复杂性。

Protein Eng. 2003 May;16(5):323-30. doi: 10.1093/protein/gzg044.

Alignment-free sequence comparison-a review.无比对序列比较——综述

Bioinformatics. 2003 Mar 1;19(4):513-23. doi: 10.1093/bioinformatics/btg005.

COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance.COMPASS：一种用于比较多个蛋白质序列比对并评估统计学显著性的工具。

J Mol Biol. 2003 Feb 7;326(1):317-36. doi: 10.1016/s0022-2836(02)01371-2.

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.MAFFT：一种基于快速傅里叶变换的快速多序列比对新方法。

Nucleic Acids Res. 2002 Jul 15;30(14):3059-66. doi: 10.1093/nar/gkf436.

Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method.估计氨基酸替换模型：Dayhoff估计器、预解式方法与最大似然法的比较。

Mol Biol Evol. 2002 Jan;19(1):8-13. doi: 10.1093/oxfordjournals.molbev.a003985.

Probabilistic and statistical properties of words: an overview.词汇的概率与统计特性：综述

J Comput Biol. 2000 Feb-Apr;7(1-2):1-46. doi: 10.1089/10665270050081360.

Simplified amino acid alphabets for protein fold recognition and implications for folding.用于蛋白质折叠识别的简化氨基酸字母表及其对折叠的影响。

Protein Eng. 2000 Mar;13(3):149-52. doi: 10.1093/protein/13.3.149.

Optimized representations and maximal information in proteins.蛋白质中的优化表示与最大信息

Proteins. 2000 Feb 1;38(2):149-64.

Touring protein fold space with Dali/FSSP.利用Dali/FSSP探索蛋白质折叠空间。

Nucleic Acids Res. 1998 Jan 1;26(1):316-9. doi: 10.1093/nar/26.1.316.

Discovering empirically conserved amino acid substitution groups in databases of protein families.在蛋白质家族数据库中实证发现保守氨基酸替代基团。

Proc Int Conf Intell Syst Mol Biol. 1996;4:230-40.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验