通过比例实例交叉验证（PICV）提高遗传关联研究中机器学习的可重复性。

Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV).

作者信息

Piette Elizabeth R, Moore Jason H

机构信息

1Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA.

2Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA.

出版信息

BioData Min. 2018 Apr 19;11:6. doi: 10.1186/s13040-018-0167-7. eCollection 2018.

DOI:10.1186/s13040-018-0167-7

PMID:29713384

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5907739/

Abstract

BACKGROUND

Machine learning methods and conventions are increasingly employed for the analysis of large, complex biomedical data sets, including genome-wide association studies (GWAS). Reproducibility of machine learning analyses of GWAS can be hampered by biological and statistical factors, particularly so for the investigation of non-additive genetic interactions. Application of traditional cross validation to a GWAS data set may result in poor consistency between the training and testing data set splits due to an imbalance of the interaction genotypes relative to the data as a whole. We propose a new cross validation method, proportional instance cross validation (PICV), that preserves the original distribution of an independent variable when splitting the data set into training and testing partitions.

RESULTS

We apply PICV to simulated GWAS data with epistatic interactions of varying minor allele frequencies and prevalences and compare performance to that of a traditional cross validation procedure in which individuals are randomly allocated to training and testing partitions. Sensitivity and positive predictive value are significantly improved across all tested scenarios for PICV compared to traditional cross validation. We also apply PICV to GWAS data from a study of primary open-angle glaucoma to investigate a previously-reported interaction, which fails to significantly replicate; PICV however improves the consistency of testing and training results.

CONCLUSIONS

Application of traditional machine learning procedures to biomedical data may require modifications to better suit intrinsic characteristics of the data, such as the potential for highly imbalanced genotype distributions in the case of epistasis detection. The reproducibility of genetic interaction findings can be improved by considering this variable imbalance in cross validation implementation, such as with PICV. This approach may be extended to problems in other domains in which imbalanced variable distributions are a concern.

摘要

背景

机器学习方法和惯例越来越多地用于分析大型复杂生物医学数据集，包括全基因组关联研究（GWAS）。GWAS机器学习分析的可重复性可能会受到生物学和统计学因素的阻碍，特别是在研究非加性基因相互作用时。将传统交叉验证应用于GWAS数据集可能会由于相互作用基因型相对于整个数据的不平衡而导致训练和测试数据集划分之间的一致性较差。我们提出了一种新的交叉验证方法，比例实例交叉验证（PICV），它在将数据集划分为训练和测试分区时保留自变量的原始分布。

结果

我们将PICV应用于具有不同次要等位基因频率和患病率的上位性相互作用的模拟GWAS数据，并将其性能与传统交叉验证程序（将个体随机分配到训练和测试分区）的性能进行比较。与传统交叉验证相比，在所有测试场景中，PICV的敏感性和阳性预测值均有显著提高。我们还将PICV应用于原发性开角型青光眼研究的GWAS数据，以研究先前报道的相互作用，该相互作用未能显著重复；然而，PICV提高了测试和训练结果的一致性。

结论

将传统机器学习程序应用于生物医学数据可能需要进行修改，以更好地适应数据的内在特征，例如在检测上位性时基因型分布可能高度不平衡。通过在交叉验证实施中考虑这种变量不平衡，例如使用PICV，可以提高基因相互作用发现的可重复性。这种方法可能会扩展到其他关注变量分布不平衡问题的领域。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/34fa/5907739/080c0f387172/13040_2018_167_Fig1_HTML.jpg

相似文献

Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV).

BioData Min. 2018 Apr 19;11:6. doi: 10.1186/s13040-018-0167-7. eCollection 2018.

Systemic treatments for metastatic cutaneous melanoma.

Cochrane Database Syst Rev. 2018 Feb 6;2(2):CD011123. doi: 10.1002/14651858.CD011123.pub2.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Management of urinary stones by experts in stone disease (ESD 2025).

Arch Ital Urol Androl. 2025 Jun 30;97(2):14085. doi: 10.4081/aiua.2025.14085.

Rapid, point-of-care antigen tests for diagnosis of SARS-CoV-2 infection.

Cochrane Database Syst Rev. 2022 Jul 22;7(7):CD013705. doi: 10.1002/14651858.CD013705.pub3.

Behavioral interventions to reduce risk for sexual transmission of HIV among men who have sex with men.

Cochrane Database Syst Rev. 2008 Jul 16(3):CD001230. doi: 10.1002/14651858.CD001230.pub2.

Comparison of cellulose, modified cellulose and synthetic membranes in the haemodialysis of patients with end-stage renal disease.

Cochrane Database Syst Rev. 2001(3):CD003234. doi: 10.1002/14651858.CD003234.

Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.

Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

Antiepileptic drug monotherapy for epilepsy: a network meta-analysis of individual participant data.

Cochrane Database Syst Rev. 2017 Dec 15;12(12):CD011412. doi: 10.1002/14651858.CD011412.pub3.

引用本文的文献

Brief Survey on Machine Learning in Epistasis.

Methods Mol Biol. 2021;2212:169-179. doi: 10.1007/978-1-0716-0947-7_11.

Intelligence Is beyond Learning: A Context-Aware Artificial Intelligent System for Video Understanding.

Comput Intell Neurosci. 2020 Dec 23;2020:8813089. doi: 10.1155/2020/8813089. eCollection 2020.

本文引用的文献

A long journey to reproducible results.

Nature. 2017 Aug 22;548(7668):387-388. doi: 10.1038/548387a.

Reproducibility of computational workflows is automated using continuous analysis.

Nat Biotechnol. 2017 Apr;35(4):342-346. doi: 10.1038/nbt.3780. Epub 2017 Mar 13.

Epistatic Gene-Based Interaction Analyses for Glaucoma in eMERGE and NEIGHBOR Consortium.

PLoS Genet. 2016 Sep 13;12(9):e1006186. doi: 10.1371/journal.pgen.1006186. eCollection 2016 Sep.

Data Sharing.

N Engl J Med. 2016 Jan 21;374(3):276-7. doi: 10.1056/NEJMe1516564.

Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans.

Science. 2015 May 8;348(6235):648-60. doi: 10.1126/science.1262110. Epub 2015 May 7.

Machine learning applications in genetics and genomics.

Nat Rev Genet. 2015 Jun;16(6):321-32. doi: 10.1038/nrg3920. Epub 2015 May 7.

Methods of integrating data to uncover genotype-phenotype interactions.

Nat Rev Genet. 2015 Feb;16(2):85-97. doi: 10.1038/nrg3868. Epub 2015 Jan 13.

Sorting out the FACS: a devil in the details.

Cell Rep. 2014 Mar 13;6(5):779-81. doi: 10.1016/j.celrep.2014.02.021.

GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures.

BioData Min. 2012 Oct 1;5(1):16. doi: 10.1186/1756-0381-5-16.

Class-imbalanced classifiers for high-dimensional data.

Brief Bioinform. 2013 Jan;14(1):13-26. doi: 10.1093/bib/bbs006. Epub 2012 Mar 9.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过比例实例交叉验证（PICV）提高遗传关联研究中机器学习的可重复性。

Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV).

作者信息

Piette Elizabeth R, Moore Jason H

机构信息

1Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA.

2Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA.