Zhou Nina, Wang Lipo
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.
Genomics Proteomics Bioinformatics. 2007 Dec;5(3-4):242-9. doi: 10.1016/S1672-0229(08)60011-X.
Single nucleotide polymorphisms (SNPs) are genetic variations that determine the differences between any two unrelated individuals. Various population groups can be distinguished from each other using SNPs. For instance, the HapMap dataset has four population groups with about ten million SNPs. For more insights on human evolution, ethnic variation, and population assignment, we propose to find out which SNPs are significant in determining the population groups and then to classify different populations using these relevant SNPs as input features. In this study, we developed a modified t-test ranking measure and applied it to the HapMap genotype data. Firstly, we rank all SNPs in comparison with other feature importance measures including F-statistics and the informativeness for assignment. Secondly, we select different numbers of the most highly ranked SNPs as the input to a classifier, such as the support vector machine, so as to find the best feature subset corresponding to the best classification accuracy. Experimental results showed that the proposed method is very effective in finding SNPs that are significant in determining the population groups, with reduced computational burden and better classification accuracy.
单核苷酸多态性(SNPs)是决定任意两个不相关个体之间差异的基因变异。利用SNPs可以区分不同的人群组。例如,HapMap数据集包含四个群体组,约有一千万个SNPs。为了更深入了解人类进化、种族变异和群体归属,我们建议找出哪些SNPs在确定群体组时具有重要意义,然后将这些相关的SNPs作为输入特征对不同群体进行分类。在本研究中,我们开发了一种改进的t检验排序方法,并将其应用于HapMap基因型数据。首先,与包括F统计量和归属信息性在内的其他特征重要性度量方法相比,我们对所有SNPs进行排序。其次,我们选择不同数量的排名最高的SNPs作为分类器(如支持向量机)的输入,以找到对应最佳分类准确率的最佳特征子集。实验结果表明,所提出的方法在寻找对确定群体组具有重要意义的SNPs方面非常有效,同时降低了计算负担并提高了分类准确率。