Suppr超能文献

一种使用多因素降维方法在不平衡数据集中进行上位性建模的平衡准确率函数。

A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction.

作者信息

Velez Digna R, White Bill C, Motsinger Alison A, Bush William S, Ritchie Marylyn D, Williams Scott M, Moore Jason H

机构信息

Center for Human Genetics Research, Vanderbilt University Medical Center, Nashville, Tennessee.

出版信息

Genet Epidemiol. 2007 May;31(4):306-15. doi: 10.1002/gepi.20211.

Abstract

Multifactor dimensionality reduction (MDR) was developed as a method for detecting statistical patterns of epistasis. The overall goal of MDR is to change the representation space of the data to make interactions easier to detect. It is well known that machine learning methods may not provide robust models when the class variable (e.g. case-control status) is imbalanced and accuracy is used as the fitness measure. This is because most methods learn patterns that are relevant for the larger of the two classes. The goal of this study was to evaluate three different strategies for improving the power of MDR to detect epistasis in imbalanced datasets. The methods evaluated were: (1) over-sampling that resamples with replacement the smaller class until the data are balanced, (2) under-sampling that randomly removes subjects from the larger class until the data are balanced, and (3) balanced accuracy [(sensitivity+specificity)/2] as the fitness function with and without an adjusted threshold. These three methods were compared using simulated data with two-locus epistatic interactions of varying heritability (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and minor allele frequency (0.2, 0.4) that were embedded in 100 replicate datasets of varying sample sizes (400, 800, 1600). Each dataset was generated with different ratios of cases to controls (1 : 1, 1 : 2, 1 : 4). We found that the balanced accuracy function with an adjusted threshold significantly outperformed both over-sampling and under-sampling and fully recovered the power. These results suggest that balanced accuracy should be used instead of accuracy for the MDR analysis of epistasis in imbalanced datasets.

摘要

多因素降维法(MDR)是作为一种检测上位性统计模式的方法而开发的。MDR的总体目标是改变数据的表示空间,以便更易于检测相互作用。众所周知,当类变量(例如病例对照状态)不平衡且将准确性用作适应度度量时,机器学习方法可能无法提供稳健的模型。这是因为大多数方法学习的是与两个类中较大的类相关的模式。本研究的目的是评估三种不同的策略,以提高MDR在不平衡数据集中检测上位性的能力。所评估的方法包括:(1)过采样,即对较小的类进行有放回重采样,直到数据平衡;(2)欠采样,即从较大的类中随机去除个体,直到数据平衡;(3)平衡准确性[(敏感性+特异性)/2]作为适应度函数,有调整阈值和无调整阈值两种情况。使用具有不同遗传力(0.01、0.025、0.05、0.1、0.2、0.3、0.4)和次要等位基因频率(0.2、0.4)的两位点上位性相互作用的模拟数据进行比较,这些数据嵌入到不同样本量(400、800、1600)的100个重复数据集中。每个数据集以不同的病例与对照比例(1:1、1:2、1:4)生成。我们发现,具有调整阈值的平衡准确性函数明显优于过采样和欠采样,并且完全恢复了检测能力。这些结果表明,在不平衡数据集中进行上位性的MDR分析时,应使用平衡准确性而非准确性。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验