Suppr超能文献

在正例未标记设置中检测预测模型的偏差验证:疾病基因优先级排序案例研究

Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study.

作者信息

Molotkov Ivan, Artomov Mykyta

机构信息

The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, United States.

Department of Pediatrics, The Ohio State University, Columbus, OH, United States.

出版信息

Bioinform Adv. 2023 Sep 14;3(1):vbad128. doi: 10.1093/bioadv/vbad128. eCollection 2023.

Abstract

MOTIVATION

Positive-unlabeled data consists of points with either positive or unknown labels. It is widespread in medical, genetic, and biological settings, creating a high demand for predictive positive-unlabeled models. The performance of such models is usually estimated using validation sets, assumed to be selected completely at random (SCAR) from known positive examples. For certain metrics, this assumption enables unbiased performance estimation when treating positive-unlabeled data as positive/negative. However, the SCAR assumption is often adopted without proper justifications, simply for the sake of convenience.

RESULTS

We provide an algorithm that under the weak assumptions of a lower bound on the number of positive examples can test for the violation of the SCAR assumption. Applying it to the problem of gene prioritization for complex genetic traits, we illustrate that the SCAR assumption is often violated there, causing the inflation of performance estimates, which we refer to as validation bias. We estimate the potential impact of validation bias on performance estimation. Our analysis reveals that validation bias is widespread in gene prioritization data and can significantly overestimate the performance of models. This finding elucidates the discrepancy between the reported good performance of models and their limited practical applications.

AVAILABILITY AND IMPLEMENTATION

Python code with examples of application of the validation bias detection algorithm is available at github.com/ArtomovLab/ValidationBias.

摘要

动机

正未标记数据由具有正标签或未知标签的点组成。它在医学、遗传学和生物学环境中广泛存在,因此对预测正未标记模型有很高的需求。此类模型的性能通常使用验证集进行估计,假设验证集是从已知正例中完全随机(SCAR)选择的。对于某些指标,在将正未标记数据视为正/负数据时,此假设可实现无偏性能估计。然而,SCAR假设常常未经适当论证就被采用,仅仅是为了方便。

结果

我们提供了一种算法,在正例数量下限的弱假设下,可以测试SCAR假设是否被违反。将其应用于复杂遗传性状的基因优先级排序问题,我们表明在该问题中SCAR假设常常被违反,导致性能估计出现偏差,我们将其称为验证偏差。我们估计了验证偏差对性能估计的潜在影响。我们的分析表明,验证偏差在基因优先级排序数据中普遍存在,并且会显著高估模型的性能。这一发现揭示了模型报告的良好性能与其有限的实际应用之间的差异。

可用性和实现方式

验证偏差检测算法应用示例的Python代码可在github.com/ArtomovLab/ValidationBias获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/098f/10517638/12002bd34921/vbad128f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验