Heidema A Geert, Feskens Edith J M, Doevendans Pieter A F M, Ruven Henk J T, van Houwelingen Hans C, Mariman Edwin C M, Boer Jolanda M A
Centre for Nutrition and Health, National Institute for Public Health and the Environment, Bilthoven, The Netherlands.
Genet Epidemiol. 2007 Dec;31(8):910-21. doi: 10.1002/gepi.20251.
Nonparametric approaches have been developed that are able to analyze large numbers of single nucleotide polymorphisms (SNPs) in modest sample sizes. These approaches have different selection features and may not provide similar results when applied to the same dataset. Therefore, we compared the results of three approaches (set association, random forests and multifactor dimensionality reduction [MDR]) to select from a total of 93 candidate SNPs a subset of SNPs that are important in determining high-density lipoprotein (HDL)-cholesterol levels. The study population consisted of a random sample from a Dutch monitoring project for cardiovascular disease risk factors and was dichotomized into cases (low HDL-cholesterol, n = 533) and non-cases (high HDL-cholesterol, n = 545) based on gender-specific median values for HDL cholesterol. Clearly, all three approaches prioritized three SNPs as important (CETP Taq1B, CETP-629 C/A and LPL Ser447X). Two SNPs with weaker main effects were additionally prioritized by random forests (APOC3 3175 G/C and CCR2 Val62Ile), whereas MTHFR 677 C/T was selected in combination with CETP Taq1B as best model by MDR. Obtained p-values for the selected models were significant for the set association approach (p =.0019), random forests (p<.01) and MDR (p<.02). In conclusion, the application of a combination of multi-locus methods is a useful approach in genetic association studies to select a well-defined set of important SNPs for further statistical and epidemiological interpretation, providing increased confidence and more information compared with the application of only one method.
已经开发出非参数方法,能够在样本量适中的情况下分析大量单核苷酸多态性(SNP)。这些方法具有不同的选择特征,应用于同一数据集时可能不会产生相似的结果。因此,我们比较了三种方法(集合关联、随机森林和多因素降维法[MDR])的结果,以便从总共93个候选SNP中选出一组对确定高密度脂蛋白(HDL)胆固醇水平至关重要的SNP子集。研究人群是从荷兰心血管疾病危险因素监测项目中随机抽取的样本,并根据HDL胆固醇的性别特异性中位数,分为病例组(HDL胆固醇水平低,n = 533)和非病例组(HDL胆固醇水平高,n = 545)。显然,所有三种方法都将三个SNP列为重要SNP(CETP Taq1B、CETP - 629 C/A和LPL Ser447X)。随机森林法还额外将两个主效应较弱的SNP列为重要SNP(APOC3 3175 G/C和CCR2 Val62Ile),而MDR法将MTHFR 677 C/T与CETP Taq1B组合选为最佳模型。所选模型的p值对于集合关联法(p = 0.0019)、随机森林法(p < 0.01)和MDR法(p < 0.02)均具有显著性。总之,在基因关联研究中,应用多种多位点方法的组合是一种有用的方法,可用于选择一组明确的重要SNP,以便进行进一步的统计和流行病学解释,与仅应用一种方法相比,能提供更高的可信度和更多信息。