Molotkov Ivan, Artomov Mykyta
The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, United States.
Department of Pediatrics, The Ohio State University, Columbus, OH, United States.
Bioinform Adv. 2023 Sep 14;3(1):vbad128. doi: 10.1093/bioadv/vbad128. eCollection 2023.
Positive-unlabeled data consists of points with either positive or unknown labels. It is widespread in medical, genetic, and biological settings, creating a high demand for predictive positive-unlabeled models. The performance of such models is usually estimated using validation sets, assumed to be selected completely at random (SCAR) from known positive examples. For certain metrics, this assumption enables unbiased performance estimation when treating positive-unlabeled data as positive/negative. However, the SCAR assumption is often adopted without proper justifications, simply for the sake of convenience.
We provide an algorithm that under the weak assumptions of a lower bound on the number of positive examples can test for the violation of the SCAR assumption. Applying it to the problem of gene prioritization for complex genetic traits, we illustrate that the SCAR assumption is often violated there, causing the inflation of performance estimates, which we refer to as validation bias. We estimate the potential impact of validation bias on performance estimation. Our analysis reveals that validation bias is widespread in gene prioritization data and can significantly overestimate the performance of models. This finding elucidates the discrepancy between the reported good performance of models and their limited practical applications.
Python code with examples of application of the validation bias detection algorithm is available at github.com/ArtomovLab/ValidationBias.
正未标记数据由具有正标签或未知标签的点组成。它在医学、遗传学和生物学环境中广泛存在,因此对预测正未标记模型有很高的需求。此类模型的性能通常使用验证集进行估计,假设验证集是从已知正例中完全随机(SCAR)选择的。对于某些指标,在将正未标记数据视为正/负数据时,此假设可实现无偏性能估计。然而,SCAR假设常常未经适当论证就被采用,仅仅是为了方便。
我们提供了一种算法,在正例数量下限的弱假设下,可以测试SCAR假设是否被违反。将其应用于复杂遗传性状的基因优先级排序问题,我们表明在该问题中SCAR假设常常被违反,导致性能估计出现偏差,我们将其称为验证偏差。我们估计了验证偏差对性能估计的潜在影响。我们的分析表明,验证偏差在基因优先级排序数据中普遍存在,并且会显著高估模型的性能。这一发现揭示了模型报告的良好性能与其有限的实际应用之间的差异。
验证偏差检测算法应用示例的Python代码可在github.com/ArtomovLab/ValidationBias获取。