Yamada Tomoyuki, Morishita Shinichi
Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5, Kashinoha, Kashiwa City, Chiba Pref. 277-8562, Japan.
J Bioinform Comput Biol. 2004 Mar;2(1):21-46. doi: 10.1142/s0219720004000454.
The sequencing of the genomes of a variety of species and the growing databases containing expressed sequence tags (ESTs) and complementary DNAs (cDNAs) facilitate the design of highly specific oligomers for use as genomic markers, PCR primers, or DNA oligo microarrays. The first step in evaluating the specificity of short oligomers of about 20 units in length is to determine the frequencies at which the oligomers occur. However, for oligomers longer than about fifty units this is not efficient, as they usually have a frequency of only 1. A more suitable procedure is to consider the mismatch tolerance of an oligomer, that is, the minimum number of mismatches that allows a given oligomer to match a substring other than the target sequence anywhere in the genome or the EST database. However, calculating the exact value of mismatch tolerance is computationally costly and impractical. Therefore, we studied the problem of checking whether an oligomer meets the constraint that its mismatch tolerance is no less than a given threshold. Here, we present an efficient dynamic programming algorithm solution that utilizes suffix and height arrays. We demonstrated the effectiveness of this algorithm by efficiently computing a dense list of numerous oligo-markers applicable to the human genome. Experimental results show that the algorithm runs faster than well-known Abrahamson's algorithm by orders of magnitude and is able to enumerate 65% approximately 76% of qualified oligomers.
对各种物种的基因组进行测序以及不断增加的包含表达序列标签(EST)和互补DNA(cDNA)的数据库,有助于设计高度特异性的寡聚物,用作基因组标记、PCR引物或DNA寡核苷酸微阵列。评估长度约为20个单位的短寡聚物特异性的第一步是确定寡聚物出现的频率。然而,对于长度超过约50个单位的寡聚物,这并不有效,因为它们的频率通常仅为1。一种更合适的方法是考虑寡聚物的错配容忍度,即允许给定寡聚物在基因组或EST数据库中的任何位置与目标序列以外的子串匹配的最小错配数。然而,计算错配容忍度的精确值在计算上成本高昂且不切实际。因此,我们研究了检查寡聚物是否满足其错配容忍度不低于给定阈值这一约束的问题。在此,我们提出了一种利用后缀数组和高度数组的高效动态规划算法解决方案。我们通过高效计算适用于人类基因组的大量寡聚物标记的密集列表,证明了该算法的有效性。实验结果表明,该算法的运行速度比著名的亚伯拉罕森算法快几个数量级,并且能够枚举约65%至76%的合格寡聚物。