Ye Kai, Jia Zhenyu, Wang Yipeng, Flicek Paul, Apweiler Rolf
Molecular Epidemiology section, Medical Statistics and Bioinformatics, Leiden University Medical Center, The Netherlands.
Department of Pathology & Laboratory Medicine, University of California, Irvine, CA 92697, USA.
J Proteomics Bioinform. 2010 Mar 16;3(3):099-103. doi: 10.4172/jpb.1000127.
Unique substrings in genomes may indicate high level of specificity which is crucial and fundamental to many genetics studies, such as PCR, microarray hybridization, Southern and Northern blotting, RNA interference (RNAi), and genome (re)sequencing. However, being unique sequence in the genome alone is not adequate to guaranty high specificity. For example, nucleotides mismatches within a certain tolerance may impair specificity even if an interested substring occur only once in the genome. In this study we propose the concept of unique- substrings of genomes for controlling specificity in genome-wide assays. A unique- substring is defined if it only has a single perfect match on one strand of the entire genome while all other approximate matches must have more than mismatches. We developed a pattern growth approach to systematically mine such unique- substrings from a given genome. Our algorithm does not need a pre-processing step to extract sequential information which is required by most of other rival methods. The search for unique- substrings from genomes is performed as a single task of regular data mining so that the similarities among queries are utilized to achieve tremendous speedup. The runtime of our algorithm is linear to the sizes of input genomes and the length of unique- substrings. In addition, the unique- mining algorithm has been parallelized to facilitate genome-wide computation on a cluster or a single machine of multiple CPUs with shared memory.
基因组中的独特子串可能表明其具有高度特异性,这对许多遗传学研究至关重要且具有基础性,比如聚合酶链式反应(PCR)、微阵列杂交、Southern和Northern印迹法、RNA干扰(RNAi)以及基因组(重)测序。然而,仅在基因组中是唯一序列并不足以保证高特异性。例如,即使某个感兴趣的子串在基因组中仅出现一次,但在一定容错范围内的核苷酸错配仍可能损害特异性。在本研究中,我们提出了基因组独特子串的概念,以控制全基因组分析中的特异性。如果一个子串在整个基因组的一条链上仅有一个完全匹配,而所有其他近似匹配必须有超过[此处原文缺失具体错配数]个错配,则定义该子串为独特子串。我们开发了一种模式增长方法,用于从给定基因组中系统地挖掘此类独特子串。我们的算法不需要预处理步骤来提取大多数其他竞争方法所需的序列信息。从基因组中搜索独特子串作为常规数据挖掘的单一任务来执行,从而利用查询之间的相似性实现极大的加速。我们算法的运行时间与输入基因组的大小以及独特子串的长度呈线性关系。此外,独特挖掘算法已被并行化,以促进在具有共享内存的集群或多CPU单机上进行全基因组计算。