Suppr超能文献

使用压缩氨基酸字母表在线性时间内进行局部同源性识别和距离测量。

Local homology recognition and distance measures in linear time using compressed amino acid alphabets.

作者信息

Edgar Robert C

出版信息

Nucleic Acids Res. 2004 Jan 16;32(1):380-5. doi: 10.1093/nar/gkh180. Print 2004.

Abstract

Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments.

摘要

描述了通过识别两个序列共有的k-mer(长度为k的连续子序列)来发现局部相似性和估计进化距离的方法。对于长度为L的未比对序列,这些方法具有O(L)的时间复杂度。研究了压缩氨基酸字母表将这些技术扩展到远缘相关蛋白质的能力。使用从FSSP数据库中选择的1848对结构可比对序列的测试集,针对不同的字母表和k的选择评估了这些算法的性能。发现从k-mer计数得出的距离度量与从序列比对得出的百分比同一性密切相关。压缩字母表在局部相似性发现中提高了性能,但在应用于距离估计时未发现性能提升的证据。将我们的局部相似性发现方法的性能与MAFFT中使用的快速傅里叶变换(FFT)进行了比较,后者具有O(L log L)的时间复杂度。这里揭示了实现与FFT相当覆盖范围的方法,并且速度快了一个多数量级。我们建议使用k-mer距离进行快速、近似的系统发育树构建,并表明相对于需要比对的标准距离方法,可以实现超过三个数量级的速度提升。

相似文献

2
On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
3
Fast model-based protein homology detection without alignment.基于快速模型的无需比对的蛋白质同源性检测。
Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

引用本文的文献

10
Protein-to-genome alignment with miniprot.用 Miniprot 进行蛋白质到基因组的比对。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad014.

本文引用的文献

2
Alignment-free sequence comparison-a review.无比对序列比较——综述
Bioinformatics. 2003 Mar 1;19(4):513-23. doi: 10.1093/bioinformatics/btg005.
6
Probabilistic and statistical properties of words: an overview.词汇的概率与统计特性:综述
J Comput Biol. 2000 Feb-Apr;7(1-2):1-46. doi: 10.1089/10665270050081360.
9
Touring protein fold space with Dali/FSSP.利用Dali/FSSP探索蛋白质折叠空间。
Nucleic Acids Res. 1998 Jan 1;26(1):316-9. doi: 10.1093/nar/26.1.316.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验