Fondazione Bruno Kessler, Trento, Italy.
BMC Bioinformatics. 2010 Oct 26;11 Suppl 8(Suppl 8):S3. doi: 10.1186/1471-2105-11-S8-S3.
Quantitative phenotypes emerge everywhere in systems biology and biomedicine due to a direct interest for quantitative traits, or to high individual variability that makes hard or impossible to classify samples into distinct categories, often the case with complex common diseases. Machine learning approaches to genotype-phenotype mapping may significantly improve Genome-Wide Association Studies (GWAS) results by explicitly focusing on predictivity and optimal feature selection in a multivariate setting. It is however essential that stringent and well documented Data Analysis Protocols (DAP) are used to control sources of variability and ensure reproducibility of results. We present a genome-to-phenotype pipeline of machine learning modules for quantitative phenotype prediction. The pipeline can be applied for the direct use of whole-genome information in functional studies. As a realistic example, the problem of fitting complex phenotypic traits in heterogeneous stock mice from single nucleotide polymorphims (SNPs) is here considered.
The core element in the pipeline is the L1L2 regularization method based on the naïve elastic net. The method gives at the same time a regression model and a dimensionality reduction procedure suitable for correlated features. Model and SNP markers are selected through a DAP originally developed in the MAQC-II collaborative initiative of the U.S. FDA for the identification of clinical biomarkers from microarray data. The L1L2 approach is compared with standard Support Vector Regression (SVR) and with Recursive Jump Monte Carlo Markov Chain (MCMC). Algebraic indicators of stability of partial lists are used for model selection; the final panel of markers is obtained by a procedure at the chromosome scale, termed 'saturation', to recover SNPs in Linkage Disequilibrium with those selected.
With respect to both MCMC and SVR, comparable accuracies are obtained by the L1L2 pipeline. Good agreement is also found between SNPs selected by the L1L2 algorithms and candidate loci previously identified by a standard GWAS. The combination of L1L2-based feature selection with a saturation procedure tackles the issue of neglecting highly correlated features that affects many feature selection algorithms.
The L1L2 pipeline has proven effective in terms of marker selection and prediction accuracy. This study indicates that machine learning techniques may support quantitative phenotype prediction, provided that adequate DAPs are employed to control bias in model selection.
由于对定量性状的直接兴趣,或者由于个体高度变异使得难以或不可能将样本分类为不同类别,这种情况在复杂的常见疾病中经常出现,因此在系统生物学和生物医学中,定量表型随处可见。通过在多变量环境中明确关注预测能力和最佳特征选择,机器学习方法在基因型 - 表型映射中可能会极大地改善全基因组关联研究(GWAS)的结果。然而,至关重要的是使用严格且有充分记录的数据分析协议(DAP)来控制变异性源,并确保结果的可重复性。我们提出了一个用于定量表型预测的机器学习模块的基因组 - 表型管道。该管道可直接用于功能研究中的全基因组信息。作为一个现实的例子,这里考虑了从单核苷酸多态性(SNP)拟合异质Stock 小鼠的复杂表型特征的问题。
管道的核心要素是基于天真弹性网的 L1L2 正则化方法。该方法同时给出了回归模型和适合相关特征的降维过程。通过最初由美国 FDA 参与的 MAQC-II 合作计划开发的 DAP 选择模型和 SNP 标记,用于从微阵列数据中识别临床生物标志物。L1L2 方法与标准支持向量回归(SVR)和递归跳跃蒙特卡罗马尔可夫链(MCMC)进行比较。用于模型选择的部分列表稳定性的代数指标;通过称为“饱和”的染色体尺度过程获得最终标记面板,以恢复与所选标记处于连锁不平衡的 SNPs。
与 MCMC 和 SVR 相比,L1L2 管道获得了可比的准确性。L1L2 算法选择的 SNPs 与先前通过标准 GWAS 确定的候选基因座之间也存在良好的一致性。基于 L1L2 的特征选择与饱和过程的结合解决了忽略影响许多特征选择算法的高度相关特征的问题。
L1L2 管道在标记选择和预测准确性方面已被证明是有效的。本研究表明,只要使用适当的 DAP 来控制模型选择中的偏差,机器学习技术就可以支持定量表型预测。