Zhang Peisen, Sheng Huitao, Uehara Ryuhei
Laboratory of Population Genetics, National Cancer Institute, NIH, Bethesda, MD 20892, USA.
BMC Bioinformatics. 2004 Jul 6;5:89. doi: 10.1186/1471-2105-5-89.
In population-based studies, it is generally recognized that single nucleotide polymorphism (SNP) markers are not independent. Rather, they are carried by haplotypes, groups of SNPs that tend to be coinherited. It is thus possible to choose a much smaller number of SNPs to use as indices for identifying haplotypes or haplotype blocks in genetic association studies. We refer to these characteristic SNPs as index SNPs. In order to reduce costs and work, a minimum number of index SNPs that can distinguish all SNP and haplotype patterns should be chosen. Unfortunately, this is an NP-complete problem, requiring brute force algorithms that are not feasible for large data sets.
We have developed a double classification tree search algorithm to generate index SNPs that can distinguish all SNP and haplotype patterns. This algorithm runs very rapidly and generates very good, though not necessarily minimum, sets of index SNPs, as is to be expected for such NP-complete problems.
A new algorithm for index SNP selection has been developed. A webserver for index SNP selection is available at http://cognia.cu-genome.org/cgi-bin/genome/snpIndex.cgi/
在基于人群的研究中,人们普遍认识到单核苷酸多态性(SNP)标记并非相互独立。相反,它们由单倍型携带,单倍型是倾向于共同遗传的SNP组。因此,在基因关联研究中,可以选择数量少得多的SNP作为识别单倍型或单倍型块的指标。我们将这些特征性SNP称为索引SNP。为了降低成本和工作量,应选择能够区分所有SNP和单倍型模式的最少数量的索引SNP。不幸的是,这是一个NP完全问题,需要暴力算法,而对于大数据集来说这是不可行的。
我们开发了一种双重分类树搜索算法来生成能够区分所有SNP和单倍型模式的索引SNP。该算法运行非常迅速,并生成了非常好的索引SNP集,尽管不一定是最小集,对于此类NP完全问题来说这是可以预期的。
已开发出一种用于选择索引SNP的新算法。可通过http://cognia.cu-genome.org/cgi-bin/genome/snpIndex.cgi/访问用于选择索引SNP的网络服务器。