Piette Elizabeth R, Moore Jason H
1Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA.
2Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA.
BioData Min. 2018 Apr 19;11:6. doi: 10.1186/s13040-018-0167-7. eCollection 2018.
Machine learning methods and conventions are increasingly employed for the analysis of large, complex biomedical data sets, including genome-wide association studies (GWAS). Reproducibility of machine learning analyses of GWAS can be hampered by biological and statistical factors, particularly so for the investigation of non-additive genetic interactions. Application of traditional cross validation to a GWAS data set may result in poor consistency between the training and testing data set splits due to an imbalance of the interaction genotypes relative to the data as a whole. We propose a new cross validation method, proportional instance cross validation (PICV), that preserves the original distribution of an independent variable when splitting the data set into training and testing partitions.
We apply PICV to simulated GWAS data with epistatic interactions of varying minor allele frequencies and prevalences and compare performance to that of a traditional cross validation procedure in which individuals are randomly allocated to training and testing partitions. Sensitivity and positive predictive value are significantly improved across all tested scenarios for PICV compared to traditional cross validation. We also apply PICV to GWAS data from a study of primary open-angle glaucoma to investigate a previously-reported interaction, which fails to significantly replicate; PICV however improves the consistency of testing and training results.
Application of traditional machine learning procedures to biomedical data may require modifications to better suit intrinsic characteristics of the data, such as the potential for highly imbalanced genotype distributions in the case of epistasis detection. The reproducibility of genetic interaction findings can be improved by considering this variable imbalance in cross validation implementation, such as with PICV. This approach may be extended to problems in other domains in which imbalanced variable distributions are a concern.
机器学习方法和惯例越来越多地用于分析大型复杂生物医学数据集,包括全基因组关联研究(GWAS)。GWAS机器学习分析的可重复性可能会受到生物学和统计学因素的阻碍,特别是在研究非加性基因相互作用时。将传统交叉验证应用于GWAS数据集可能会由于相互作用基因型相对于整个数据的不平衡而导致训练和测试数据集划分之间的一致性较差。我们提出了一种新的交叉验证方法,比例实例交叉验证(PICV),它在将数据集划分为训练和测试分区时保留自变量的原始分布。
我们将PICV应用于具有不同次要等位基因频率和患病率的上位性相互作用的模拟GWAS数据,并将其性能与传统交叉验证程序(将个体随机分配到训练和测试分区)的性能进行比较。与传统交叉验证相比,在所有测试场景中,PICV的敏感性和阳性预测值均有显著提高。我们还将PICV应用于原发性开角型青光眼研究的GWAS数据,以研究先前报道的相互作用,该相互作用未能显著重复;然而,PICV提高了测试和训练结果的一致性。
将传统机器学习程序应用于生物医学数据可能需要进行修改,以更好地适应数据的内在特征,例如在检测上位性时基因型分布可能高度不平衡。通过在交叉验证实施中考虑这种变量不平衡,例如使用PICV,可以提高基因相互作用发现的可重复性。这种方法可能会扩展到其他关注变量分布不平衡问题的领域。