基因组选择中用于选择单核苷酸多态性（SNP）的机器学习分类程序：在肉鸡早期死亡率中的应用

Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers.

作者信息

Long N, Gianola D, Rosa G J M, Weigel K A, Avendaño S

机构信息

Department of Animal Sciences, University of Wisconsin, Madison, WI 53706, USA.

出版信息

J Anim Breed Genet. 2007 Dec;124(6):377-89. doi: 10.1111/j.1439-0388.2007.00694.x.

DOI:10.1111/j.1439-0388.2007.00694.x

PMID:18076475

Abstract

Genome-wide association studies using single nucleotide polymorphisms (SNPs) can identify genetic variants related to complex traits. Typically thousands of SNPs are genotyped, whereas the number of phenotypes for which there is genomic information may be smaller. When predicting phenotypes, options for statistical model building range from incorporating all possible markers into the specification to including only sets of relevant SNPs (features). In the latter case, an efficient method of selecting influential features is required. A two-step feature selection method for binary traits was developed, which consisted of filtering (using information gain), and wrapping (using naïve Bayesian classification). The filter reduces the large number of SNPs to a much smaller size, to facilitate the wrapper step. As the procedure is tailored for discrete outcomes, an approach based on discretization of phenotypic values was developed, to enable feature selection in a classification framework. The method was applied to chick mortality rates (0-14 days of age) on progeny from 201 sires in a commercial broiler line, with the goal of identifying SNPs (over 5000) related to progeny mortality. To mimic a case-control study, sires were clustered into two groups, low and high, according to two arbitrarily chosen mortality rate cut points. By varying these thresholds, 11 different 'case-control' samples were formed, and the SNP selection procedure was applied to each sample. To compare the 11 sets of chosen SNPs, predicted residual sum of squares (PRESS) from a linear model was used. The two-step method improved naïve Bayesian classification accuracy over the case without feature selection (from around 50 to above 90% without and with feature selection in each case-control sample). The best case-control group (63 sires above or below the thresholds) had the smallest PRESS statistic among groups with model p-values below 0.003. The 17 SNPs selected using this group accounted for 31% of the variation in raw mortality rates between sire families.

摘要

使用单核苷酸多态性（SNP）进行全基因组关联研究可以识别与复杂性状相关的遗传变异。通常会对数千个SNP进行基因分型，而拥有基因组信息的表型数量可能较少。在预测表型时，统计模型构建的选项范围从将所有可能的标记纳入模型设定到仅包含相关SNP集（特征）。在后一种情况下，需要一种有效的方法来选择有影响力的特征。开发了一种用于二元性状的两步特征选择方法，该方法包括过滤（使用信息增益）和包装（使用朴素贝叶斯分类）。过滤器将大量的SNP减少到小得多的规模，以便于包装步骤。由于该程序是针对离散结果量身定制的，因此开发了一种基于表型值离散化的方法，以便在分类框架中进行特征选择。该方法应用于一个商业肉鸡品系中201个父系后代的雏鸡死亡率（0至14日龄），目的是识别与后代死亡率相关的SNP（超过5000个）。为了模拟病例对照研究，根据两个任意选择的死亡率切点将父系分为低和高两组。通过改变这些阈值，形成了11个不同的“病例对照”样本，并将SNP选择程序应用于每个样本。为了比较这11组选定的SNP，使用了线性模型的预测残差平方和（PRESS）。与没有特征选择的情况相比，两步法提高了朴素贝叶斯分类的准确性（在每个病例对照样本中，无特征选择时约为50%，有特征选择时高于90%）。最佳病例对照组（阈值上下各63个父系）在模型p值低于0.003的组中具有最小的PRESS统计量。使用该组选择的17个SNP占父系家族间原始死亡率差异的31%。