Hide W, Burke J, Davison D B
Department of Biochemical and Biophysical Sciences, University of Houston, TX 77204-5934, USA.
J Comput Biol. 1994 Fall;1(3):199-215. doi: 10.1089/cmb.1994.1.199.
A number of algorithms exist for searching sequence databases for biologically significant similarities based on the primary sequence similarity of aligned sequences. We have determined the biological sensitivity and selectivity of d2, a high-performance comparison algorithm that rapidly determines the relative dissimilarity of large datasets of genetic sequences. d2 uses sequence-word multiplicity as a simple measure of dissimilarity. It is not constrained by the comparison of direct sequence alignments and so can use word contexts to yield new information on relationships. It is extremely efficient, comparing a query of length 884 bases (INS1ECLAC) with 19,540,603 bases of the bacterial division of GenBank (release 76.0) in 51.77 CPU seconds on a Cray Y/MP-48 supercomputer. It is unique in that subsequences (words) of biological interest can be weighted to improve the sensitivity and selectivity of a search over existing methods. We have determined the ability of d2 to detect biologically significant matches between a query and large datasets of DNA sequences while varying parameters such as word-length and window size. We have also determined the distribution of dissimilarity scores within eukaryotic and prokaryotic divisions of GenBank. We have optimized parameters of the d2 program using Cray hardware and present an analysis of the sensitivity and selectivity of the algorithm. A theoretical analysis of the expectation for scores is presented. This work demonstrates that d2 is a unique, sensitive, and selective method of rapid sequence comparison that can detect novel sequence relationships which remain undetected by alternate methodologies.
存在许多用于在序列数据库中基于比对序列的一级序列相似性搜索具有生物学意义的相似性的算法。我们已经确定了d2的生物学敏感性和选择性,d2是一种高性能比较算法,可快速确定遗传序列大型数据集的相对差异。d2使用序列词多重性作为差异的简单度量。它不受直接序列比对比较的限制,因此可以使用词上下文来产生关于关系的新信息。它极其高效,在一台Cray Y/MP - 48超级计算机上,用51.77 CPU秒就能将长度为884个碱基(INS1ECLAC)的查询序列与GenBank细菌分类部分的19540603个碱基(版本76.0)进行比较。其独特之处在于,可以对具有生物学意义的子序列(词)进行加权,以提高搜索相对于现有方法的敏感性和选择性。我们已经确定了d2在改变诸如词长和窗口大小等参数时,检测查询序列与大型DNA序列数据集之间生物学上显著匹配的能力。我们还确定了GenBank真核生物和原核生物分类中差异分数的分布。我们使用Cray硬件对d2程序的参数进行了优化,并对该算法的敏感性和选择性进行了分析。给出了分数期望的理论分析。这项工作表明,d2是一种独特、敏感且具有选择性的快速序列比较方法,能够检测到其他方法未发现的新序列关系。