• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

空格词和 kmacs:基于不精确词匹配的快速无对齐序列比较。

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.

机构信息

University of Göttingen, Institute of Microbiology and Genetics, Department of Bioinformatics, Goldschmidtstraße 1, 37073 Göttingen, Germany

University of Göttingen, Institute of Microbiology and Genetics, Department of Bioinformatics, Goldschmidtstraße 1, 37073 Göttingen, Germany.

出版信息

Nucleic Acids Res. 2014 Jul;42(Web Server issue):W7-11. doi: 10.1093/nar/gku398. Epub 2014 May 14.

DOI:10.1093/nar/gku398
PMID:24829447
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4086093/
Abstract

In this article, we present a user-friendly web interface for two alignment-free sequence-comparison methods that we recently developed. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. By contrast, our new algorithms are based on inexact word matches. The first of these approaches uses the relative frequencies of so-called spaced words in the input sequences, i.e. words containing 'don't care' or 'wildcard' symbols at certain pre-defined positions. Various distance measures can then be defined on sequences based on their different spaced-word composition. Our second approach defines the distance between two sequences by estimating for each position in the first sequence the length of the longest substring at this position that also occurs in the second sequence with up to k mismatches. Both approaches take a set of deoxyribonucleic acid (DNA) or protein sequences as input and return a matrix of pairwise distance values that can be used as a starting point for clustering algorithms or distance-based phylogeny reconstruction. The two alignment-free programmes are accessible through a web interface at 'Göttingen Bioinformatics Compute Server (GOBICS)': http://spaced.gobics.de http://kmacs.gobics.de and the source codes can be downloaded.

摘要

在本文中,我们为最近开发的两种无比对序列比较方法提供了一个用户友好的网络界面。大多数无比对方法依赖于精确的单词匹配来估计输入序列之间的两两相似性或距离。相比之下,我们的新算法基于不精确的单词匹配。第一种方法使用输入序列中所谓的间隔字的相对频率,即包含“不在乎”或“通配符”符号的字在某些预定义位置。然后可以基于它们不同的间隔字组成定义序列上的各种距离度量。我们的第二种方法通过估计在第一个序列中的每个位置,来定义两个序列之间的距离,即估计在该位置上的第一个序列中的最长子串的长度,该子串在第二个序列中也有最多 k 个不匹配。这两种无比对方法都以一组脱氧核糖核酸(DNA)或蛋白质序列作为输入,并返回一个两两距离值矩阵,可作为聚类算法或基于距离的系统发育重建的起点。这两个无比对程序可通过“哥廷根生物信息学计算服务器(GOBICS)”的网络界面访问:http://spaced.gobics.de/http://kmacs.gobics.de ,也可以下载源代码。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d07a/4086093/01353a1e4bf1/gku398fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d07a/4086093/3215f0cc1ea6/gku398fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d07a/4086093/fa944537eda7/gku398fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d07a/4086093/01353a1e4bf1/gku398fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d07a/4086093/3215f0cc1ea6/gku398fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d07a/4086093/fa944537eda7/gku398fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d07a/4086093/01353a1e4bf1/gku398fig3.jpg

相似文献

1
Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.空格词和 kmacs:基于不精确词匹配的快速无对齐序列比较。
Nucleic Acids Res. 2014 Jul;42(Web Server issue):W7-11. doi: 10.1093/nar/gku398. Epub 2014 May 14.
2
Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.Kmacs:一种无比对的序列比对方法,通过 k-错配平均公共子串实现。
Bioinformatics. 2014 Jul 15;30(14):2000-8. doi: 10.1093/bioinformatics/btu331. Epub 2014 May 13.
3
Fast alignment-free sequence comparison using spaced-word frequencies.基于空位词频的快速无比对序列比较。
Bioinformatics. 2014 Jul 15;30(14):1991-9. doi: 10.1093/bioinformatics/btu177. Epub 2014 Apr 3.
4
Fast and accurate phylogeny reconstruction using filtered spaced-word matches.使用过滤后的间隔词匹配进行快速准确的系统发育重建。
Bioinformatics. 2017 Apr 1;33(7):971-979. doi: 10.1093/bioinformatics/btw776.
5
Sequence Comparison Without Alignment: The SpaM Approaches.无需比对的序列比较:SpaM方法
Methods Mol Biol. 2021;2231:121-134. doi: 10.1007/978-1-0716-1036-7_8.
6
Estimating evolutionary distances between genomic sequences from spaced-word matches.通过间隔词匹配估计基因组序列之间的进化距离。
Algorithms Mol Biol. 2015 Feb 11;10:5. doi: 10.1186/s13015-015-0032-x. eCollection 2015.
7
Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points.使用过滤的间隔字匹配作为锚点,对远缘基因组序列进行精确的多重比对。
Bioinformatics. 2019 Jan 15;35(2):211-218. doi: 10.1093/bioinformatics/bty592.
8
rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison.拉斯巴里:优化间隔种子用于数据库搜索、读段映射和无比对序列比较
PLoS Comput Biol. 2016 Oct 19;12(10):e1005107. doi: 10.1371/journal.pcbi.1005107. eCollection 2016 Oct.
9
The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.两个 DNA 序列之间 k-mer 匹配的数量作为 k 的函数,以及在估计系统发育距离中的应用。
PLoS One. 2020 Feb 10;15(2):e0228070. doi: 10.1371/journal.pone.0228070. eCollection 2020.
10
DIALIGN at GOBICS--multiple sequence alignment using various sources of external information.DIALIGN 在 GOBICS 中的应用——使用多种外部信息源进行多重序列比对。
Nucleic Acids Res. 2013 Jul;41(Web Server issue):W3-7. doi: 10.1093/nar/gkt283. Epub 2013 Apr 24.

引用本文的文献

1
Enhancing Taxonomic Categorization of DNA Sequences with Deep Learning: A Multi-Label Approach.利用深度学习增强DNA序列的分类:一种多标签方法。
Bioengineering (Basel). 2023 Nov 8;10(11):1293. doi: 10.3390/bioengineering10111293.
2
Interpreting alignment-free sequence comparison: what makes a score a good score?解读无比对序列比较:什么样的分数才是好分数?
NAR Genom Bioinform. 2022 Sep 5;4(3):lqac062. doi: 10.1093/nargab/lqac062. eCollection 2022 Sep.
3
Insertions and deletions as phylogenetic signal in an alignment-free context.

本文引用的文献

1
Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.Kmacs:一种无比对的序列比对方法,通过 k-错配平均公共子串实现。
Bioinformatics. 2014 Jul 15;30(14):2000-8. doi: 10.1093/bioinformatics/btu331. Epub 2014 May 13.
2
Fast alignment-free sequence comparison using spaced-word frequencies.基于空位词频的快速无比对序列比较。
Bioinformatics. 2014 Jul 15;30(14):1991-9. doi: 10.1093/bioinformatics/btu177. Epub 2014 Apr 3.
3
Shared gene structures and clusters of mutually exclusive spliced exons within the metazoan muscle myosin heavy chain genes.
插入和缺失作为无比对背景下的系统发育信号。
PLoS Comput Biol. 2022 Aug 8;18(8):e1010303. doi: 10.1371/journal.pcbi.1010303. eCollection 2022 Aug.
4
Accurate reconstruction of viral genomes in human cells from short reads using iterative refinement.利用迭代细化技术,从短读长序列中准确重建人类细胞中的病毒基因组。
BMC Genomics. 2022 Jun 6;23(1):422. doi: 10.1186/s12864-022-08649-8.
5
Specificity Analysis of Genome Based on Statistically Identical K-Words With Same Base Combination.基于具有相同碱基组合的统计上相同的K字的基因组特异性分析
IEEE Open J Eng Med Biol. 2020 Jul 14;1:214-219. doi: 10.1109/OJEMB.2020.3009055. eCollection 2020.
6
'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees.“多间隔词匹配法”:一种使用多个间隔词匹配和四重树进行系统发育重建的最大似然法。
NAR Genom Bioinform. 2019 Oct 30;2(1):lqz013. doi: 10.1093/nargab/lqz013. eCollection 2020 Mar.
7
Sequence Comparison Without Alignment: The SpaM Approaches.无需比对的序列比较:SpaM方法
Methods Mol Biol. 2021;2231:121-134. doi: 10.1007/978-1-0716-1036-7_8.
8
An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction.一种基于比对的启发式算法,用于快速的序列比对,可应用于系统发育重建。
BMC Bioinformatics. 2020 Nov 18;21(Suppl 6):404. doi: 10.1186/s12859-020-03738-5.
9
The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.两个 DNA 序列之间 k-mer 匹配的数量作为 k 的函数,以及在估计系统发育距离中的应用。
PLoS One. 2020 Feb 10;15(2):e0228070. doi: 10.1371/journal.pone.0228070. eCollection 2020.
10
SWeeP: representing large biological sequences datasets in compact vectors.SWeeP:用紧凑向量表示大型生物序列数据集。
Sci Rep. 2020 Jan 9;10(1):91. doi: 10.1038/s41598-019-55627-4.
后生动物肌球蛋白重链基因中的共享基因结构和相互排斥剪接外显子簇。
PLoS One. 2014 Feb 3;9(2):e88111. doi: 10.1371/journal.pone.0088111. eCollection 2014.
4
Alignment-free phylogeny of whole genomes using underlying subwords.利用潜在子词进行全基因组的无比对系统发育分析。
Algorithms Mol Biol. 2012 Dec 6;7(1):34. doi: 10.1186/1748-7188-7-34.
5
Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis.通过混沌游戏表示法进行模式匹配:为生物序列分析搭建数字与离散数据结构之间的桥梁。
Algorithms Mol Biol. 2012 May 2;7(1):10. doi: 10.1186/1748-7188-7-10.
6
FastTree 2--approximately maximum-likelihood trees for large alignments.FastTree 2--用于大型比对的近似最大似然树。
PLoS One. 2010 Mar 10;5(3):e9490. doi: 10.1371/journal.pone.0009490.
7
Genomic DNA k-mer spectra: models and modalities.基因组 DNA k--mer 频谱:模型与模态。
Genome Biol. 2009;10(10):R108. doi: 10.1186/gb-2009-10-10-r108. Epub 2009 Oct 8.
8
Estimating mutation distances from unaligned genomes.从未比对的基因组估计突变距离。
J Comput Biol. 2009 Oct;16(10):1487-500. doi: 10.1089/cmb.2009.0106.
9
Pattern-based phylogenetic distance estimation and tree reconstruction.基于模式的系统发育距离估计和树重建。
Evol Bioinform Online. 2007 Feb 25;2:359-75.
10
Reconstructing the phylogeny of 21 completely sequenced arthropod species based on their motor proteins.基于运动蛋白重建21种全基因组测序节肢动物物种的系统发育。
BMC Genomics. 2009 Apr 21;10:173. doi: 10.1186/1471-2164-10-173.