Suppr超能文献

使用压缩氨基酸字母表在线性时间内进行局部同源性识别和距离测量。

Local homology recognition and distance measures in linear time using compressed amino acid alphabets.

作者信息

Edgar Robert C

出版信息

Nucleic Acids Res. 2004 Jan 16;32(1):380-5. doi: 10.1093/nar/gkh180. Print 2004.

Abstract

Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments.

摘要

描述了通过识别两个序列共有的k-mer(长度为k的连续子序列)来发现局部相似性和估计进化距离的方法。对于长度为L的未比对序列,这些方法具有O(L)的时间复杂度。研究了压缩氨基酸字母表将这些技术扩展到远缘相关蛋白质的能力。使用从FSSP数据库中选择的1848对结构可比对序列的测试集,针对不同的字母表和k的选择评估了这些算法的性能。发现从k-mer计数得出的距离度量与从序列比对得出的百分比同一性密切相关。压缩字母表在局部相似性发现中提高了性能,但在应用于距离估计时未发现性能提升的证据。将我们的局部相似性发现方法的性能与MAFFT中使用的快速傅里叶变换(FFT)进行了比较,后者具有O(L log L)的时间复杂度。这里揭示了实现与FFT相当覆盖范围的方法,并且速度快了一个多数量级。我们建议使用k-mer距离进行快速、近似的系统发育树构建,并表明相对于需要比对的标准距离方法,可以实现超过三个数量级的速度提升。

相似文献

1
Local homology recognition and distance measures in linear time using compressed amino acid alphabets.
Nucleic Acids Res. 2004 Jan 16;32(1):380-5. doi: 10.1093/nar/gkh180. Print 2004.
2
On the quality of tree-based protein classification.
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
3
Fast model-based protein homology detection without alignment.
Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.
4
SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.
Syst Biol. 2012 Jan;61(1):90-106. doi: 10.1093/sysbio/syr095. Epub 2011 Dec 1.
5
OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy.
BMC Bioinformatics. 2003 Oct 10;4:47. doi: 10.1186/1471-2105-4-47.
6
Scoredist: a simple and robust protein sequence distance estimator.
BMC Bioinformatics. 2005 Apr 27;6:108. doi: 10.1186/1471-2105-6-108.
8
Combination of threading potentials and sequence profiles improves fold recognition.
J Mol Biol. 2000 Mar 10;296(5):1319-31. doi: 10.1006/jmbi.2000.3541.
9
transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.
BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.

引用本文的文献

3
A comparison of various feature extraction and machine learning methods for antimicrobial resistance prediction in .
Front Antibiot. 2023 Mar 24;2:1126468. doi: 10.3389/frabi.2023.1126468. eCollection 2023.
5
An alignment-free method for detection of missing regions for phylogenetic analysis.
Heliyon. 2024 Jun 4;10(11):e32227. doi: 10.1016/j.heliyon.2024.e32227. eCollection 2024 Jun 15.
6
Portable BLAST-like algorithm library and its implementations for command line, Python, and R.
PLoS One. 2023 Nov 30;18(11):e0289693. doi: 10.1371/journal.pone.0289693. eCollection 2023.
7
Whole genome sequencing-based identification of human tuberculosis caused by animal-lineage .
J Clin Microbiol. 2023 Nov 21;61(11):e0026023. doi: 10.1128/jcm.00260-23. Epub 2023 Oct 25.
8
On closing the inopportune gap with consistency transformation and iterative refinement.
PLoS One. 2023 Jul 13;18(7):e0287483. doi: 10.1371/journal.pone.0287483. eCollection 2023.
9
Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2.
Genome Res. 2023 Jul;33(7):1218-1227. doi: 10.1101/gr.277655.123. Epub 2023 Jul 6.
10
Protein-to-genome alignment with miniprot.
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad014.

本文引用的文献

1
Reduction of protein sequence complexity by residue grouping.
Protein Eng. 2003 May;16(5):323-30. doi: 10.1093/protein/gzg044.
2
Alignment-free sequence comparison-a review.
Bioinformatics. 2003 Mar 1;19(4):513-23. doi: 10.1093/bioinformatics/btg005.
3
COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance.
J Mol Biol. 2003 Feb 7;326(1):317-36. doi: 10.1016/s0022-2836(02)01371-2.
4
MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.
Nucleic Acids Res. 2002 Jul 15;30(14):3059-66. doi: 10.1093/nar/gkf436.
6
Probabilistic and statistical properties of words: an overview.
J Comput Biol. 2000 Feb-Apr;7(1-2):1-46. doi: 10.1089/10665270050081360.
7
Simplified amino acid alphabets for protein fold recognition and implications for folding.
Protein Eng. 2000 Mar;13(3):149-52. doi: 10.1093/protein/13.3.149.
9
Touring protein fold space with Dali/FSSP.
Nucleic Acids Res. 1998 Jan 1;26(1):316-9. doi: 10.1093/nar/26.1.316.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验