• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

d2的生物学评估,一种用于高性能序列比较的算法。

Biological evaluation of d2, an algorithm for high-performance sequence comparison.

作者信息

Hide W, Burke J, Davison D B

机构信息

Department of Biochemical and Biophysical Sciences, University of Houston, TX 77204-5934, USA.

出版信息

J Comput Biol. 1994 Fall;1(3):199-215. doi: 10.1089/cmb.1994.1.199.

DOI:10.1089/cmb.1994.1.199
PMID:8790465
Abstract

A number of algorithms exist for searching sequence databases for biologically significant similarities based on the primary sequence similarity of aligned sequences. We have determined the biological sensitivity and selectivity of d2, a high-performance comparison algorithm that rapidly determines the relative dissimilarity of large datasets of genetic sequences. d2 uses sequence-word multiplicity as a simple measure of dissimilarity. It is not constrained by the comparison of direct sequence alignments and so can use word contexts to yield new information on relationships. It is extremely efficient, comparing a query of length 884 bases (INS1ECLAC) with 19,540,603 bases of the bacterial division of GenBank (release 76.0) in 51.77 CPU seconds on a Cray Y/MP-48 supercomputer. It is unique in that subsequences (words) of biological interest can be weighted to improve the sensitivity and selectivity of a search over existing methods. We have determined the ability of d2 to detect biologically significant matches between a query and large datasets of DNA sequences while varying parameters such as word-length and window size. We have also determined the distribution of dissimilarity scores within eukaryotic and prokaryotic divisions of GenBank. We have optimized parameters of the d2 program using Cray hardware and present an analysis of the sensitivity and selectivity of the algorithm. A theoretical analysis of the expectation for scores is presented. This work demonstrates that d2 is a unique, sensitive, and selective method of rapid sequence comparison that can detect novel sequence relationships which remain undetected by alternate methodologies.

摘要

存在许多用于在序列数据库中基于比对序列的一级序列相似性搜索具有生物学意义的相似性的算法。我们已经确定了d2的生物学敏感性和选择性,d2是一种高性能比较算法,可快速确定遗传序列大型数据集的相对差异。d2使用序列词多重性作为差异的简单度量。它不受直接序列比对比较的限制,因此可以使用词上下文来产生关于关系的新信息。它极其高效,在一台Cray Y/MP - 48超级计算机上,用51.77 CPU秒就能将长度为884个碱基(INS1ECLAC)的查询序列与GenBank细菌分类部分的19540603个碱基(版本76.0)进行比较。其独特之处在于,可以对具有生物学意义的子序列(词)进行加权,以提高搜索相对于现有方法的敏感性和选择性。我们已经确定了d2在改变诸如词长和窗口大小等参数时,检测查询序列与大型DNA序列数据集之间生物学上显著匹配的能力。我们还确定了GenBank真核生物和原核生物分类中差异分数的分布。我们使用Cray硬件对d2程序的参数进行了优化,并对该算法的敏感性和选择性进行了分析。给出了分数期望的理论分析。这项工作表明,d2是一种独特、敏感且具有选择性的快速序列比较方法,能够检测到其他方法未发现的新序列关系。

相似文献

1
Biological evaluation of d2, an algorithm for high-performance sequence comparison.d2的生物学评估,一种用于高性能序列比较的算法。
J Comput Biol. 1994 Fall;1(3):199-215. doi: 10.1089/cmb.1994.1.199.
2
A measure of DNA sequence dissimilarity based on free energy of nearest-neighbor interaction.基于最近邻相互作用自由能的 DNA 序列差异度量。
J Biomol Struct Dyn. 2011 Feb;28(4):557-65. doi: 10.1080/07391102.2011.10508595.
3
Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.随机序列之间精确和近似单词匹配的渐近行为及最优单词大小
BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S21. doi: 10.1186/1471-2105-7-S5-S21.
4
Characterizing the D2 statistic: word matches in biological sequences.表征D2统计量:生物序列中的单词匹配
Stat Appl Genet Mol Biol. 2009;8:Article 43. doi: 10.2202/1544-6115.1447. Epub 2009 Oct 8.
5
A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases.一种用于序列数据库比较的快速算法:应用于识别EMBL数据库中的载体污染。
Bioinformatics. 1999 Feb;15(2):111-21. doi: 10.1093/bioinformatics/15.2.111.
6
An efficient similarity search based on indexing in large DNA databases.基于索引的大型 DNA 数据库中的高效相似性搜索。
Comput Biol Chem. 2010 Apr;34(2):131-6. doi: 10.1016/j.compbiolchem.2010.03.007. Epub 2010 Apr 4.
7
HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.HBLAST:并行化序列相似性——一种可通过Hadoop进行MapReduce的基本局部比对搜索工具。
J Biomed Inform. 2015 Apr;54:58-64. doi: 10.1016/j.jbi.2015.01.008. Epub 2015 Jan 24.
8
Incremental window-based protein sequence alignment algorithms.基于窗口递增的蛋白质序列比对算法。
Bioinformatics. 2007 Jan 15;23(2):e17-23. doi: 10.1093/bioinformatics/btl297.
9
Designing a bioengine for detection and analysis of base string on an affected sequence in high-concentration regions.设计一种生物工程,用于在高浓度区域中对受影响序列上的碱基串进行检测和分析。
Biomed Res Int. 2013;2013:372646. doi: 10.1155/2013/372646. Epub 2013 Aug 13.
10
A symmetric-iterated multiple alignment of protein sequences.蛋白质序列的对称迭代多序列比对。
J Mol Biol. 1998 Feb 13;276(1):249-64. doi: 10.1006/jmbi.1997.1527.

引用本文的文献

1
Benchmarking of alignment-free sequence comparison methods.无比对信息的序列比较方法的基准测试。
Genome Biol. 2019 Jul 25;20(1):144. doi: 10.1186/s13059-019-1755-7.
2
Alignment-free similarity analysis for protein sequences based on fuzzy integral.基于模糊积分的蛋白质序列无对齐相似性分析。
Sci Rep. 2019 Feb 26;9(1):2775. doi: 10.1038/s41598-019-39477-8.
3
Optimal choice of word length when comparing two Markov sequences using a χ -statistic.使用 χ ²统计量比较两个马尔可夫序列时的最佳字长选择。
BMC Genomics. 2017 Oct 3;18(Suppl 6):732. doi: 10.1186/s12864-017-4020-z.
4
Alignment-free sequence comparison: benefits, applications, and tools.无比对信息的序列比对:优势、应用和工具。
Genome Biol. 2017 Oct 3;18(1):186. doi: 10.1186/s13059-017-1319-7.
5
Alignment-free inference of hierarchical and reticulate phylogenomic relationships.基于无比对的方法推断系统发生的分支和网状结构关系。
Brief Bioinform. 2019 Mar 22;20(2):426-435. doi: 10.1093/bib/bbx067.
6
Fast genotyping of known SNPs through approximate k-mer matching.通过近似k-mer匹配对已知单核苷酸多态性进行快速基因分型。
Bioinformatics. 2016 Sep 1;32(17):i538-i544. doi: 10.1093/bioinformatics/btw460.
7
Metagenomic Classification Using an Abstraction Augmented Markov Model.使用抽象增强马尔可夫模型的宏基因组分类
J Comput Biol. 2016 Feb;23(2):111-122. doi: 10.1089/cmb.2015.0141. Epub 2015 Nov 30.
8
Handling Permutation in Sequence Comparison: Genome-Wide Enhancer Prediction in Vertebrates by a Novel Non-Linear Alignment Scoring Principle.序列比较中排列的处理:基于一种新型非线性比对评分原则的脊椎动物全基因组增强子预测
PLoS One. 2015 Oct 27;10(10):e0141487. doi: 10.1371/journal.pone.0141487. eCollection 2015.
9
Next generation sequencing reads comparison with an alignment-free distance.使用无比对距离的下一代测序读数比较
BMC Res Notes. 2014 Dec 3;7:869. doi: 10.1186/1756-0500-7-869.
10
Inferring phylogenies of evolving sequences without multiple sequence alignment.无需多序列比对推断进化序列的系统发育树。
Sci Rep. 2014 Sep 30;4:6504. doi: 10.1038/srep06504.