Liu Yang, Ng Michael
Centre for Mathematical Imaging and Vision, Hong Kong Baptist University, Hong Kong.
BMC Syst Biol. 2010 Sep 13;4 Suppl 2(Suppl 2):S5. doi: 10.1186/1752-0509-4-S2-S5.
Recent development of high-resolution single nucleotide polymorphism (SNP) arrays allows detailed assessment of genome-wide human genome variations. There is increasing recognition of the importance of SNPs for medicine and developmental biology. However, SNP data set typically has a large number of SNPs (e.g., 400 thousand SNPs in genome-wide Parkinson disease data set) and a few hundred of samples. Conventional classification methods may not be effective when applied to such genome-wide SNP data.
In this paper, we use shrunken dissimilarity measure to analyze and select relevant SNPs for classification problems. Examples of HapMap data and Parkinson disease (PD) data are given to demonstrate the effectiveness of the proposed method, and illustrate it has a potential to become a useful analysis tool for SNP data sets. We use Parkinson disease data as an example, and perform a whole genome analysis. For the 367440 SNPs with less than 1% missing percentage from all 22 chromosomes, we can select 357 SNPs from this data set. For the unique genes that those SNPs are located in, a gene-gene similarity value is computed using GOSemSim and gene pairs that has a similarity value being greater than a threshold are selected to construct several groups of genes. For the SNPs that involved in these groups of genes, a statistical software PLINK is employed to compute the pair-wise SNP-SNP interactions, and SNPs with significance of P < 0.01 are chosen to identify SNPs networks based on their P values. Here SNPs networks are constructed based on Gene Ontology knowledge, and therefore each SNP network plays a role in the biological process. An analysis shows that such networks have relationships directly or indirectly to Parkinson disease.
Experimental results show that our approach is suitable to handle genetic variations, and provide useful knowledge in a genome-wide SNP study.
高分辨率单核苷酸多态性(SNP)阵列的最新发展使得能够对全基因组人类基因组变异进行详细评估。人们越来越认识到SNP在医学和发育生物学中的重要性。然而,SNP数据集通常包含大量的SNP(例如,全基因组帕金森病数据集中有40万个SNP)和几百个样本。传统的分类方法应用于此类全基因组SNP数据时可能无效。
在本文中,我们使用收缩差异度量来分析和选择用于分类问题的相关SNP。给出了HapMap数据和帕金森病(PD)数据的示例,以证明所提出方法的有效性,并说明它有潜力成为SNP数据集的有用分析工具。我们以帕金森病数据为例,进行全基因组分析。对于来自所有22条染色体的缺失率小于1%的367440个SNP,我们可以从该数据集中选择357个SNP。对于这些SNP所在的独特基因,使用GOSemSim计算基因-基因相似性值,并选择相似性值大于阈值的基因对来构建几组基因。对于涉及这些基因组的SNP,使用统计软件PLINK计算成对的SNP-SNP相互作用,并选择P<0.01的SNP根据其P值识别SNP网络。这里基于基因本体知识构建SNP网络,因此每个SNP网络在生物过程中发挥作用。分析表明,这样的网络与帕金森病直接或间接相关。
实验结果表明,我们的方法适用于处理遗传变异,并在全基因组SNP研究中提供有用的知识。