Rahmann Sven
Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Inestrasse 73, D-14195 Berlin, Germany.
Proc IEEE Comput Soc Bioinform Conf. 2003;2:57-64.
The design of large scale DNA microarrays is a challenging problem. So far, probe selection algorithms must trade the ability to cope with large scale problems for a loss of accuracy in the estimation of probe quality. We present an approach based on jumps in matching statistics that combines the best of both worlds. This article consists of two parts. The first part is theoretical. We introduce the notion of jumps in matching statistics between two strings and derive their properties. We estimate the frequency of jumps for random strings in a non-uniform Bernoulli model and present a new heuristic argument to find the center of the length distribution of the longest substring that two random strings have in common. The results are generalized to near-perfect matches with a small number of mismatches. In the second part, we use the concept of jumps to improve the accuracy of the longest common factor approach for probe selection by moving from a string-based to an energy-based specificity measure, while only slightly more than doubling the selection time.
大规模DNA微阵列的设计是一个具有挑战性的问题。到目前为止,探针选择算法必须在处理大规模问题的能力与探针质量估计准确性的损失之间进行权衡。我们提出了一种基于匹配统计量跳跃的方法,它结合了两者的优点。本文由两部分组成。第一部分是理论性的。我们引入了两个字符串之间匹配统计量跳跃的概念,并推导了它们的性质。我们估计了非均匀伯努利模型中随机字符串跳跃的频率,并提出了一种新的启发式方法来找到两个随机字符串最长公共子串长度分布的中心。结果被推广到具有少量错配的近似完美匹配。在第二部分中,我们通过从基于字符串的特异性度量转向基于能量的特异性度量,利用跳跃的概念提高了用于探针选择的最长公因子方法的准确性,同时选择时间仅略微增加了一倍多。