Rahmann Sven
Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
Bioinformatics. 2003 Oct;19 Suppl 2:ii156-61. doi: 10.1093/bioinformatics/btg1073.
During microarray production, several thousands of oligonucleotides (short DNA sequences) are synthesized in parallel, one nucleotide at a time. We are interested in finding the shortest possible nucleotide deposition sequence to synthesize all oligos in order to reduce production time and increase oligo quality. Thus we study the shortest common super-sequence problem of several thousand short strings over a four-letter alphabet.
We present a statistical analysis of the basic ALPHABET-LEFTMOST approximation algorithm, and propose several practical heuristics to reduce the length of the super-sequence. Our results show that it is hard to beat ALPHABET-LEFTMOST in the microarray production setting by more than 2 characters, but these savings can improve overall oligo quality by more than four percent.
Source code in C may be obtained by contacting the author, or from http://oligos.molgen.mpg.de.
在微阵列生产过程中,数千个寡核苷酸(短DNA序列)是并行合成的,每次合成一个核苷酸。我们感兴趣的是找到合成所有寡核苷酸的最短可能核苷酸沉积序列,以减少生产时间并提高寡核苷酸质量。因此,我们研究了在一个由四个字母组成的字母表上数千个短字符串的最短公共超序列问题。
我们对基本的“字母表最左”近似算法进行了统计分析,并提出了几种实用的启发式方法来缩短超序列的长度。我们的结果表明,在微阵列生产环境中,很难使超序列长度比“字母表最左”算法缩短超过2个字符,但这些节省可以使整体寡核苷酸质量提高超过4%。
可以通过联系作者或从http://oligos.molgen.mpg.de获取C语言源代码。