在正例未标记设置中检测预测模型的偏差验证：疾病基因优先级排序案例研究

Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study.

作者信息

Molotkov Ivan, Artomov Mykyta

机构信息

The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, United States.

Department of Pediatrics, The Ohio State University, Columbus, OH, United States.

出版信息

Bioinform Adv. 2023 Sep 14;3(1):vbad128. doi: 10.1093/bioadv/vbad128. eCollection 2023.

DOI:10.1093/bioadv/vbad128

PMID:37745001

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10517638/

Abstract

MOTIVATION

Positive-unlabeled data consists of points with either positive or unknown labels. It is widespread in medical, genetic, and biological settings, creating a high demand for predictive positive-unlabeled models. The performance of such models is usually estimated using validation sets, assumed to be selected completely at random (SCAR) from known positive examples. For certain metrics, this assumption enables unbiased performance estimation when treating positive-unlabeled data as positive/negative. However, the SCAR assumption is often adopted without proper justifications, simply for the sake of convenience.

RESULTS

We provide an algorithm that under the weak assumptions of a lower bound on the number of positive examples can test for the violation of the SCAR assumption. Applying it to the problem of gene prioritization for complex genetic traits, we illustrate that the SCAR assumption is often violated there, causing the inflation of performance estimates, which we refer to as validation bias. We estimate the potential impact of validation bias on performance estimation. Our analysis reveals that validation bias is widespread in gene prioritization data and can significantly overestimate the performance of models. This finding elucidates the discrepancy between the reported good performance of models and their limited practical applications.

AVAILABILITY AND IMPLEMENTATION

Python code with examples of application of the validation bias detection algorithm is available at github.com/ArtomovLab/ValidationBias.

摘要

动机

正未标记数据由具有正标签或未知标签的点组成。它在医学、遗传学和生物学环境中广泛存在，因此对预测正未标记模型有很高的需求。此类模型的性能通常使用验证集进行估计，假设验证集是从已知正例中完全随机（SCAR）选择的。对于某些指标，在将正未标记数据视为正/负数据时，此假设可实现无偏性能估计。然而，SCAR假设常常未经适当论证就被采用，仅仅是为了方便。

结果

我们提供了一种算法，在正例数量下限的弱假设下，可以测试SCAR假设是否被违反。将其应用于复杂遗传性状的基因优先级排序问题，我们表明在该问题中SCAR假设常常被违反，导致性能估计出现偏差，我们将其称为验证偏差。我们估计了验证偏差对性能估计的潜在影响。我们的分析表明，验证偏差在基因优先级排序数据中普遍存在，并且会显著高估模型的性能。这一发现揭示了模型报告的良好性能与其有限的实际应用之间的差异。

可用性和实现方式

验证偏差检测算法应用示例的Python代码可在github.com/ArtomovLab/ValidationBias获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/098f/10517638/12002bd34921/vbad128f1.jpg

相似文献

Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study.在正例未标记设置中检测预测模型的偏差验证：疾病基因优先级排序案例研究

Bioinform Adv. 2023 Sep 14;3(1):vbad128. doi: 10.1093/bioadv/vbad128. eCollection 2023.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Precision and bias of a normal finite mixture distribution model to analyze twin data when zygosity is unknown: simulations and application to IQ phenotypes on a large sample of twin pairs.当合子性未知时，用于分析双胞胎数据的正态有限混合分布模型的精度和偏差：模拟及在大量双胞胎对智商表型上的应用

Behav Genet. 2006 Nov;36(6):935-46. doi: 10.1007/s10519-006-9086-3. Epub 2006 Jun 20.

A robust DF-REML framework for variance components estimation in genetic studies.一种稳健的 DF-REML 框架，用于遗传研究中的方差分量估计。

Bioinformatics. 2017 Nov 15;33(22):3584-3594. doi: 10.1093/bioinformatics/btx457.

Improved variance estimation of classification performance via reduction of bias caused by small sample size.通过减少小样本量导致的偏差来改进分类性能的方差估计。

BMC Bioinformatics. 2006 Mar 13;7:127. doi: 10.1186/1471-2105-7-127.

Overview of the epidemiology methods and applications: strengths and limitations of observational study designs.流行病学方法与应用概述：观察性研究设计的优势与局限性。

Crit Rev Food Sci Nutr. 2010;50 Suppl 1(s1):10-2. doi: 10.1080/10408398.2010.526838.

NIAPU: network-informed adaptive positive-unlabeled learning for disease gene identification.NIAPU：用于疾病基因识别的基于网络信息的自适应阳性无标签学习。

Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btac848.

PHOTONAI-A Python API for rapid machine learning model development.PHOTONAI-用于快速机器学习模型开发的 Python API。

PLoS One. 2021 Jul 21;16(7):e0254062. doi: 10.1371/journal.pone.0254062. eCollection 2021.

Polar labeling: silver standard algorithm for training disease classifiers.极性标记：用于训练疾病分类器的银标准算法。

Bioinformatics. 2020 May 1;36(10):3200-3206. doi: 10.1093/bioinformatics/btaa088.

Adaptive one-class Gaussian processes allow accurate prioritization of oncology drug targets.自适应单类高斯过程可准确优先考虑肿瘤药物靶点。

Bioinformatics. 2021 Jun 16;37(10):1420-1427. doi: 10.1093/bioinformatics/btaa968.

本文引用的文献

Global Biobank Meta-analysis Initiative: Powering genetic discovery across human disease.全球生物样本库荟萃分析计划：推动人类疾病的基因发现

Cell Genom. 2022 Oct 12;2(10):100192. doi: 10.1016/j.xgen.2022.100192.

"Guilt by association" is not competitive with genetic association for identifying autism risk genes.“关联有罪”在鉴定自闭症风险基因方面并不能与遗传关联相竞争。

Sci Rep. 2021 Aug 5;11(1):15950. doi: 10.1038/s41598-021-95321-y.

Prioritization of disease genes from GWAS using ensemble-based positive-unlabeled learning.基于集成的正负无标签学习的 GWAS 疾病基因优先级排序。

Eur J Hum Genet. 2021 Oct;29(10):1527-1535. doi: 10.1038/s41431-021-00930-w. Epub 2021 Jul 19.

Array programming with NumPy.使用 NumPy 进行数组编程。

Nature. 2020 Sep;585(7825):357-362. doi: 10.1038/s41586-020-2649-2. Epub 2020 Sep 16.

Genome-wide prediction and prioritization of human aging genes by data fusion: a machine learning approach.基于数据融合的人类衰老基因的全基因组预测和优先级排序：一种机器学习方法。

BMC Genomics. 2019 Nov 9;20(1):832. doi: 10.1186/s12864-019-6140-0.

A Survey of Gene Prioritization Tools for Mendelian and Complex Human Diseases.孟德尔和复杂人类疾病基因优先级排序工具综述

J Integr Bioinform. 2019 Sep 9;16(4):20180069. doi: 10.1515/jib-2018-0069.

Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies.估计正例未标记学习中的分类准确率：特征描述与校正策略。

Pac Symp Biocomput. 2019;24:124-135.

The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers.COSMIC 癌症基因目录：描述所有人类癌症中的遗传功能障碍。

Nat Rev Cancer. 2018 Nov;18(11):696-705. doi: 10.1038/s41568-018-0060-1.

C-PUGP: A cluster-based positive unlabeled learning method for disease gene prediction and prioritization.C-PUGP：一种基于聚类的阳性无标签学习方法，用于疾病基因预测和优先级排序。

Comput Biol Chem. 2018 Oct;76:23-31. doi: 10.1016/j.compbiolchem.2018.05.022. Epub 2018 Jun 1.

Disease genes prediction by HMM based PU-learning using gene expression profiles.基于基因表达谱的 HMM 基于 PU 学习的疾病基因预测。

J Biomed Inform. 2018 May;81:102-111. doi: 10.1016/j.jbi.2018.03.006. Epub 2018 Mar 20.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在正例未标记设置中检测预测模型的偏差验证：疾病基因优先级排序案例研究

Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现方式

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献