一种使用多因素降维方法在不平衡数据集中进行上位性建模的平衡准确率函数。

A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction.

作者信息

Velez Digna R, White Bill C, Motsinger Alison A, Bush William S, Ritchie Marylyn D, Williams Scott M, Moore Jason H

机构信息

Center for Human Genetics Research, Vanderbilt University Medical Center, Nashville, Tennessee.

出版信息

Genet Epidemiol. 2007 May;31(4):306-15. doi: 10.1002/gepi.20211.

DOI:10.1002/gepi.20211

PMID:17323372

Abstract

Multifactor dimensionality reduction (MDR) was developed as a method for detecting statistical patterns of epistasis. The overall goal of MDR is to change the representation space of the data to make interactions easier to detect. It is well known that machine learning methods may not provide robust models when the class variable (e.g. case-control status) is imbalanced and accuracy is used as the fitness measure. This is because most methods learn patterns that are relevant for the larger of the two classes. The goal of this study was to evaluate three different strategies for improving the power of MDR to detect epistasis in imbalanced datasets. The methods evaluated were: (1) over-sampling that resamples with replacement the smaller class until the data are balanced, (2) under-sampling that randomly removes subjects from the larger class until the data are balanced, and (3) balanced accuracy [(sensitivity+specificity)/2] as the fitness function with and without an adjusted threshold. These three methods were compared using simulated data with two-locus epistatic interactions of varying heritability (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and minor allele frequency (0.2, 0.4) that were embedded in 100 replicate datasets of varying sample sizes (400, 800, 1600). Each dataset was generated with different ratios of cases to controls (1 : 1, 1 : 2, 1 : 4). We found that the balanced accuracy function with an adjusted threshold significantly outperformed both over-sampling and under-sampling and fully recovered the power. These results suggest that balanced accuracy should be used instead of accuracy for the MDR analysis of epistasis in imbalanced datasets.

摘要

多因素降维法（MDR）是作为一种检测上位性统计模式的方法而开发的。MDR的总体目标是改变数据的表示空间，以便更易于检测相互作用。众所周知，当类变量（例如病例对照状态）不平衡且将准确性用作适应度度量时，机器学习方法可能无法提供稳健的模型。这是因为大多数方法学习的是与两个类中较大的类相关的模式。本研究的目的是评估三种不同的策略，以提高MDR在不平衡数据集中检测上位性的能力。所评估的方法包括：（1）过采样，即对较小的类进行有放回重采样，直到数据平衡；（2）欠采样，即从较大的类中随机去除个体，直到数据平衡；（3）平衡准确性[（敏感性+特异性）/2]作为适应度函数，有调整阈值和无调整阈值两种情况。使用具有不同遗传力（0.01、0.025、0.05、0.1、0.2、0.3、0.4）和次要等位基因频率（0.2、0.4）的两位点上位性相互作用的模拟数据进行比较，这些数据嵌入到不同样本量（400、800、1600）的100个重复数据集中。每个数据集以不同的病例与对照比例（1:1、1:2、1:4）生成。我们发现，具有调整阈值的平衡准确性函数明显优于过采样和欠采样，并且完全恢复了检测能力。这些结果表明，在不平衡数据集中进行上位性的MDR分析时，应使用平衡准确性而非准确性。

相似文献

A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction.一种使用多因素降维方法在不平衡数据集中进行上位性建模的平衡准确率函数。

Genet Epidemiol. 2007 May;31(4):306-15. doi: 10.1002/gepi.20211.

Exploring the performance of Multifactor Dimensionality Reduction in large scale SNP studies and in the presence of genetic heterogeneity among epistatic disease models.探索多因素降维法在大规模单核苷酸多态性研究以及上位性疾病模型存在基因异质性情况下的性能表现。

Hum Hered. 2009;67(3):183-92. doi: 10.1159/000181157. Epub 2008 Dec 15.

Class Balanced Multifactor Dimensionality Reduction to Detect Gene-Gene Interactions.基于类别平衡的多因子降维方法检测基因-基因交互作用

IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):71-81. doi: 10.1109/TCBB.2018.2858776. Epub 2018 Jul 23.

A novel survival multifactor dimensionality reduction method for detecting gene-gene interactions with application to bladder cancer prognosis.一种新的生存多因素降维方法，用于检测膀胱癌预后的基因-基因相互作用。

Hum Genet. 2011 Jan;129(1):101-10. doi: 10.1007/s00439-010-0905-5. Epub 2010 Oct 28.

A Belief Degree-Associated Fuzzy Multifactor Dimensionality Reduction Framework for Epistasis Detection.基于置信度关联的模糊多因子降维框架用于检测基因互作。

Methods Mol Biol. 2021;2212:307-323. doi: 10.1007/978-1-0716-0947-7_19.

An empirical fuzzy multifactor dimensionality reduction method for detecting gene-gene interactions.一种用于检测基因-基因相互作用的经验模糊多因素降维方法。

BMC Genomics. 2017 Mar 14;18(Suppl 2):115. doi: 10.1186/s12864-017-3496-x.

Model-Based Multifactor Dimensionality Reduction to detect epistasis for quantitative traits in the presence of error-free and noisy data.基于模型的多因素降维分析，用于检测在无误差和噪声数据情况下的数量性状的上位性。

Eur J Hum Genet. 2011 Jun;19(6):696-703. doi: 10.1038/ejhg.2011.17. Epub 2011 Mar 16.

MDR-ER: balancing functions for adjusting the ratio in risk classes and classification errors for imbalanced cases and controls using multifactor-dimensionality reduction.多因素降维在调整风险类别的比例和不平衡病例和对照的分类误差方面的平衡功能。

PLoS One. 2013 Nov 13;8(11):e79387. doi: 10.1371/journal.pone.0079387. eCollection 2013.

Evaluating the detection ability of a range of epistasis detection methods on simulated data for pure and impure epistatic models.评估一系列上位性检测方法在纯上位性模型和不纯上位性模型的模拟数据中的检测能力。

PLoS One. 2022 Feb 18;17(2):e0263390. doi: 10.1371/journal.pone.0263390. eCollection 2022.

A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility.一种灵活的计算框架，用于在人类疾病易感性的遗传研究中检测、表征和解释上位性的统计模式。

J Theor Biol. 2006 Jul 21;241(2):252-61. doi: 10.1016/j.jtbi.2005.11.036. Epub 2006 Feb 2.

引用本文的文献

Decoding sexual dimorphism of the sex-shared nervous system at single-neuron resolution.在单神经元分辨率下解析性别共享神经系统的性二态性。

bioRxiv. 2025 May 8:2024.12.27.630541. doi: 10.1101/2024.12.27.630541.

Decoding sexual dimorphism of the sex-shared nervous system at single-neuron resolution.在单神经元分辨率下解码性别共享神经系统的性二态性。

Sci Adv. 2025 Jul 11;11(28):eadv9106. doi: 10.1126/sciadv.adv9106.

Ecological features facilitating spread of alien plants along Mediterranean mountain roads.促进外来植物沿地中海山区道路扩散的生态特征。

Biol Invasions. 2024;26(11):3879-3899. doi: 10.1007/s10530-024-03418-y. Epub 2024 Aug 8.

High-performing cross-dataset machine learning reveals robust microbiota alteration in secondary apical periodontitis.高表现跨数据集机器学习揭示了继发性根尖周炎中稳健的微生物群改变。

Front Cell Infect Microbiol. 2024 Jun 21;14:1393108. doi: 10.3389/fcimb.2024.1393108. eCollection 2024.

LORIS robustly predicts patient outcomes with immune checkpoint blockade therapy using common clinical, pathologic and genomic features.LORIS 使用常见的临床、病理和基因组特征，稳健地预测了接受免疫检查点阻断治疗的患者的预后。

Nat Cancer. 2024 Aug;5(8):1158-1175. doi: 10.1038/s43018-024-00772-7. Epub 2024 Jun 3.

SEEI: spherical evolution with feedback mechanism for identifying epistatic interactions.SEEI：用于识别上位相互作用的球形进化与反馈机制。

BMC Genomics. 2024 May 13;25(1):462. doi: 10.1186/s12864-024-10373-4.

Amphiregulin, ST2, and REG3α biomarker risk algorithms as predictors of nonrelapse mortality in patients with acute GVHD.成纤维细胞生长因子 21、ST2 和 REG3α 生物标志物风险算法预测急性移植物抗宿主病患者非复发死亡率。

Blood Adv. 2024 Jun 25;8(12):3284-3292. doi: 10.1182/bloodadvances.2023011049.

Development and validation of a random forest algorithm for source attribution of animal and human Typhimurium and monophasic variants of Typhimurium isolates in England and Wales utilising whole genome sequencing data.利用全基因组测序数据开发并验证一种随机森林算法，用于英格兰和威尔士动物及人类鼠伤寒沙门氏菌以及鼠伤寒沙门氏菌单相变体分离株的溯源分析。

Front Microbiol. 2024 Mar 12;14:1254860. doi: 10.3389/fmicb.2023.1254860. eCollection 2023.

Harnessing Speech-Derived Digital Biomarkers to Detect and Quantify Cognitive Decline Severity in Older Adults.利用语音衍生的数字生物标志物来检测和量化老年人认知能力下降的严重程度。

Gerontology. 2024;70(4):429-438. doi: 10.1159/000536250. Epub 2024 Jan 12.

Artificial Intelligence in Scoliosis Classification: An Investigation of Language-Based Models.人工智能在脊柱侧弯分类中的应用：基于语言模型的研究

J Pers Med. 2023 Dec 9;13(12):1695. doi: 10.3390/jpm13121695.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种使用多因素降维方法在不平衡数据集中进行上位性建模的平衡准确率函数。

A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献