• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用自监督通用线性模型快速无比对预测序列比对同一性得分

: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.

作者信息

Girgis Hani Z, James Benjamin T, Luczak Brian B

机构信息

Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, 700 University Boulevard, Kingsville, TX 78363, USA.

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139, USA.

出版信息

NAR Genom Bioinform. 2021 Feb 1;3(1):lqab001. doi: 10.1093/nargab/lqab001. eCollection 2021 Mar.

DOI:10.1093/nargab/lqab001
PMID:33554117
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7850047/
Abstract

Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic-slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment-including gaps-of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose , which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2-80 times. was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by was the closest to the reference tree (in contrast to andi, FSWM and Mash). is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.

摘要

成对全局比对是序列分析中的一个基本步骤。最优比对算法速度极慢,尤其是在处理长序列时。在许多涉及大型序列数据集的应用中,所需要的只是计算同一性得分(在两个序列的最优比对中,包括空位,相同核苷酸的百分比);无需可视化每两个序列是如何比对的。对于这些应用,我们提出了一种方法,它使用无比对方法和自监督广义线性模型为大量DNA序列对生成全局同一性得分。这个新工具首次能够在线性时间和空间内预测成对同一性得分。在两个大规模序列数据库上,该方法在灵敏度和精度之间提供了最佳折衷,同时比BLAST、Mash、MUMmer4和USEARCH快2至80倍。在寻找低同一性匹配时,该方法是表现最佳的工具。在从大约6000个转录本构建系统发育树时,基于该方法报告的得分构建的树最接近参考树(与andi、FSWM和Mash形成对比)。该方法能够生成数百万核苷酸长的细菌基因组的成对同一性得分;这一任务无法由任何基于全局比对的工具完成。可用性:https://github.com/BioinformaticsToolsmith/Identity

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4dc/7850047/643397807587/lqab001fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4dc/7850047/e5d2f33108b5/lqab001fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4dc/7850047/56086de17d6b/lqab001fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4dc/7850047/643397807587/lqab001fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4dc/7850047/e5d2f33108b5/lqab001fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4dc/7850047/56086de17d6b/lqab001fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4dc/7850047/643397807587/lqab001fig3.jpg

相似文献

1
: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.使用自监督通用线性模型快速无比对预测序列比对同一性得分
NAR Genom Bioinform. 2021 Feb 1;3(1):lqab001. doi: 10.1093/nargab/lqab001. eCollection 2021 Mar.
2
MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.MMseqs软件套件,用于对大型蛋白质序列集进行快速且深入的聚类和搜索。
Bioinformatics. 2016 May 1;32(9):1323-30. doi: 10.1093/bioinformatics/btw006. Epub 2016 Jan 6.
3
Fast and accurate phylogeny reconstruction using filtered spaced-word matches.使用过滤后的间隔词匹配进行快速准确的系统发育重建。
Bioinformatics. 2017 Apr 1;33(7):971-979. doi: 10.1093/bioinformatics/btw776.
4
VSEARCH: a versatile open source tool for metagenomics.VSEARCH:一款用于宏基因组学的多功能开源工具。
PeerJ. 2016 Oct 18;4:e2584. doi: 10.7717/peerj.2584. eCollection 2016.
5
On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
6
Large-scale comparison of protein sequence alignment algorithms with structure alignments.蛋白质序列比对算法与结构比对的大规模比较。
Proteins. 2000 Jul 1;40(1):6-22. doi: 10.1002/(sici)1097-0134(20000701)40:1<6::aid-prot30>3.0.co;2-7.
7
A survey and evaluations of histogram-based statistics in alignment-free sequence comparison.基于直方图的无比对序列比较统计的调查与评估。
Brief Bioinform. 2019 Jul 19;20(4):1222-1237. doi: 10.1093/bib/bbx161.
8
Taxonium, a web-based tool for exploring large phylogenetic trees.Taxonium,一个用于探索大型系统发育树的网络工具。
Elife. 2022 Nov 15;11:e82392. doi: 10.7554/eLife.82392.
9
Ancestral sequence alignment under optimal conditions.在最佳条件下进行祖先序列比对。
BMC Bioinformatics. 2005 Nov 17;6:273. doi: 10.1186/1471-2105-6-273.
10
SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.SATe-II:一种非常快速且准确的同时估计多个序列比对和系统发育树的方法。
Syst Biol. 2012 Jan;61(1):90-106. doi: 10.1093/sysbio/syr095. Epub 2011 Dec 1.

引用本文的文献

1
clustering of long-read amplicons improves phylogenetic insight into microbiome data.长读长扩增子的聚类提高了对微生物组数据的系统发育洞察力。
Gut Microbes. 2025 Dec;17(1):2516703. doi: 10.1080/19490976.2025.2516703. Epub 2025 Jun 11.
2
Quantifying Bone Collagen Fingerprint Variation Between Species.量化不同物种间的骨胶原指纹变异
Mol Ecol Resour. 2025 May;25(4):e14072. doi: 10.1111/1755-0998.14072. Epub 2025 Jan 29.
3
CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences.CGRclust:用于未标记DNA序列双对比聚类的混沌游戏表示法

本文引用的文献

1
The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.两个 DNA 序列之间 k-mer 匹配的数量作为 k 的函数,以及在估计系统发育距离中的应用。
PLoS One. 2020 Feb 10;15(2):e0228070. doi: 10.1371/journal.pone.0228070. eCollection 2020.
2
Phylonium: fast estimation of evolutionary distances from large samples of similar genomes.Phylonium:从大量相似基因组中快速估计进化距离。
Bioinformatics. 2020 Apr 1;36(7):2040-2046. doi: 10.1093/bioinformatics/btz903.
3
Benchmarking of alignment-free sequence comparison methods.
BMC Genomics. 2024 Dec 18;25(1):1214. doi: 10.1186/s12864-024-11135-y.
4
Haplotype-resolved nonaploid genome provides insights into flowering in bamboos.单倍型解析的九倍体基因组为竹子开花提供了见解。
Hortic Res. 2024 Sep 4;11(12):uhae250. doi: 10.1093/hr/uhae250. eCollection 2024 Dec.
5
Look4LTRs: a Long terminal repeat retrotransposon detection tool capable of cross species studies and discovering recently nested repeats.Look4LTRs:一种能够进行跨物种研究并发现近期嵌套重复序列的长末端重复逆转录转座子检测工具。
Mob DNA. 2024 Apr 16;15(1):8. doi: 10.1186/s13100-024-00317-w.
6
Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent.斑驳:通过利用短读映射器和梯度下降实现高分歧下精确的双序列替换距离。
PLoS One. 2024 Mar 21;19(3):e0298834. doi: 10.1371/journal.pone.0298834. eCollection 2024.
7
DeepRaccess: high-speed RNA accessibility prediction using deep learning.DeepRaccess:使用深度学习进行高速RNA可及性预测
Front Bioinform. 2023 Oct 10;3:1275787. doi: 10.3389/fbinf.2023.1275787. eCollection 2023.
8
Evaluation of metric and representation learning approaches: Effects of representations driven by relative distance on the performance.度量和表示学习方法的评估:由相对距离驱动的表示对性能的影响。
2023 Intell Method Syst Appl (2023). 2023 Jul;2023:545-550. doi: 10.1109/imsa58542.2023.10217475. Epub 2023 Aug 24.
9
Interpreting alignment-free sequence comparison: what makes a score a good score?解读无比对序列比较:什么样的分数才是好分数?
NAR Genom Bioinform. 2022 Sep 5;4(3):lqac062. doi: 10.1093/nargab/lqac062. eCollection 2022 Sep.
10
MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores.MeShClust v3.0:使用均值漂移算法和无比对身份分数对 DNA 序列进行高质量聚类。
BMC Genomics. 2022 Jun 6;23(1):423. doi: 10.1186/s12864-022-08619-0.
无比对信息的序列比较方法的基准测试。
Genome Biol. 2019 Jul 25;20(1):144. doi: 10.1186/s13059-019-1755-7.
4
Look4TRs: a de novo tool for detecting simple tandem repeats using self-supervised hidden Markov models.Look4TRs:一种使用自监督隐马尔可夫模型检测简单串联重复序列的新工具。
Bioinformatics. 2020 Jan 15;36(2):380-387. doi: 10.1093/bioinformatics/btz551.
5
Skmer: assembly-free and alignment-free sample identification using genome skims.Skmer:使用基因组草图进行无组装和无比对的样本识别。
Genome Biol. 2019 Feb 13;20(1):34. doi: 10.1186/s13059-019-1632-4.
6
MeShClust: an intelligent tool for clustering DNA sequences.MeShClust:一种用于聚类 DNA 序列的智能工具。
Nucleic Acids Res. 2018 Aug 21;46(14):e83. doi: 10.1093/nar/gky315.
7
MUMmer4: A fast and versatile genome alignment system.MUMmer4:一种快速且通用的基因组比对系统。
PLoS Comput Biol. 2018 Jan 26;14(1):e1005944. doi: 10.1371/journal.pcbi.1005944. eCollection 2018 Jan.
8
A global ocean atlas of eukaryotic genes.一部真核生物基因的全球海洋图谱。
Nat Commun. 2018 Jan 25;9(1):373. doi: 10.1038/s41467-017-02342-1.
9
A survey and evaluations of histogram-based statistics in alignment-free sequence comparison.基于直方图的无比对序列比较统计的调查与评估。
Brief Bioinform. 2019 Jul 19;20(4):1222-1237. doi: 10.1093/bib/bbx161.
10
Alignment-free sequence comparison: benefits, applications, and tools.无比对信息的序列比对:优势、应用和工具。
Genome Biol. 2017 Oct 3;18(1):186. doi: 10.1186/s13059-017-1319-7.