使用自监督通用线性模型快速无比对预测序列比对同一性得分

: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.

作者信息

Girgis Hani Z, James Benjamin T, Luczak Brian B

机构信息

Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, 700 University Boulevard, Kingsville, TX 78363, USA.

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139, USA.

出版信息

NAR Genom Bioinform. 2021 Feb 1;3(1):lqab001. doi: 10.1093/nargab/lqab001. eCollection 2021 Mar.

DOI:10.1093/nargab/lqab001

PMID:33554117

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7850047/

Abstract

Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic-slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment-including gaps-of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose , which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2-80 times. was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by was the closest to the reference tree (in contrast to andi, FSWM and Mash). is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.

摘要

成对全局比对是序列分析中的一个基本步骤。最优比对算法速度极慢，尤其是在处理长序列时。在许多涉及大型序列数据集的应用中，所需要的只是计算同一性得分（在两个序列的最优比对中，包括空位，相同核苷酸的百分比）；无需可视化每两个序列是如何比对的。对于这些应用，我们提出了一种方法，它使用无比对方法和自监督广义线性模型为大量DNA序列对生成全局同一性得分。这个新工具首次能够在线性时间和空间内预测成对同一性得分。在两个大规模序列数据库上，该方法在灵敏度和精度之间提供了最佳折衷，同时比BLAST、Mash、MUMmer4和USEARCH快2至80倍。在寻找低同一性匹配时，该方法是表现最佳的工具。在从大约6000个转录本构建系统发育树时，基于该方法报告的得分构建的树最接近参考树（与andi、FSWM和Mash形成对比）。该方法能够生成数百万核苷酸长的细菌基因组的成对同一性得分；这一任务无法由任何基于全局比对的工具完成。可用性：https://github.com/BioinformaticsToolsmith/Identity

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用自监督通用线性模型快速无比对预测序列比对同一性得分

: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

使用自监督通用线性模型快速无比对预测序列比对同一性得分

: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献