Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.
BMC Bioinformatics. 2010 Jul 22;11:394. doi: 10.1186/1471-2105-11-394.
It is hypothesized that common, complex diseases may be due to complex interactions between genetic and environmental factors, which are difficult to detect in high-dimensional data using traditional statistical approaches. Multifactor Dimensionality Reduction (MDR) is the most commonly used data-mining method to detect epistatic interactions. In all data-mining methods, it is important to consider internal validation procedures to obtain prediction estimates to prevent model over-fitting and reduce potential false positive findings. Currently, MDR utilizes cross-validation for internal validation. In this study, we incorporate the use of a three-way split (3WS) of the data in combination with a post-hoc pruning procedure as an alternative to cross-validation for internal model validation to reduce computation time without impairing performance. We compare the power to detect true disease causing loci using MDR with both 5- and 10-fold cross-validation to MDR with 3WS for a range of single-locus and epistatic disease models. Additionally, we analyze a dataset in HIV immunogenetics to demonstrate the results of the two strategies on real data.
MDR with 3WS is computationally approximately five times faster than 5-fold cross-validation. The power to find the exact true disease loci without detecting false positive loci is higher with 5-fold cross-validation than with 3WS before pruning. However, the power to find the true disease causing loci in addition to false positive loci is equivalent to the 3WS. With the incorporation of a pruning procedure after the 3WS, the power of the 3WS approach to detect only the exact disease loci is equivalent to that of MDR with cross-validation. In the real data application, the cross-validation and 3WS analyses indicate the same two-locus model.
Our results reveal that the performance of the two internal validation methods is equivalent with the use of pruning procedures. The specific pruning procedure should be chosen understanding the trade-off between identifying all relevant genetic effects but including false positives and missing important genetic factors. This implies 3WS may be a powerful and computationally efficient approach to screen for epistatic effects, and could be used to identify candidate interactions in large-scale genetic studies.
据推测,常见的复杂疾病可能是由于遗传和环境因素之间的复杂相互作用所致,而传统的统计方法很难在高维数据中检测到这些相互作用。多因子维度缩减(MDR)是最常用的数据挖掘方法,用于检测上位性相互作用。在所有数据挖掘方法中,考虑内部验证程序以获得预测估计值以防止模型过度拟合并减少潜在的假阳性发现非常重要。目前,MDR 利用交叉验证进行内部验证。在这项研究中,我们结合使用数据的三向拆分(3WS)和事后修剪过程作为替代交叉验证的内部模型验证方法,以减少计算时间而不会影响性能。我们比较了使用 MDR 与 5 倍和 10 倍交叉验证检测真实疾病致病基因座的能力,以及 MDR 与 3WS 结合使用的能力,用于一系列单基因座和上位性疾病模型。此外,我们分析了 HIV 免疫遗传学中的数据集,以证明这两种策略在真实数据上的结果。
与 5 倍交叉验证相比,使用 3WS 的 MDR 的计算速度大约快 5 倍。在修剪之前,5 倍交叉验证找到没有检测到假阳性基因座的精确真实疾病基因座的能力高于 3WS。但是,找到真实疾病基因座加上假阳性基因座的能力与 3WS 相当。在 3WS 之后采用修剪程序,3WS 方法仅检测精确疾病基因座的能力与交叉验证的 MDR 相当。在实际数据应用中,交叉验证和 3WS 分析表明了相同的双基因座模型。
我们的结果表明,两种内部验证方法的性能在使用修剪程序时是等效的。应根据识别所有相关遗传效应但包括假阳性和缺失重要遗传因素之间的权衡选择特定的修剪程序。这意味着 3WS 可能是一种强大且计算效率高的筛选上位性效应的方法,并可用于在大规模遗传研究中识别候选相互作用。