Villanova School of Business, Villanova, PA, United States of America.
Robert H. Smith School of Business, University of Maryland, College Park, MD, United States of America.
PLoS One. 2024 Oct 16;19(10):e0296904. doi: 10.1371/journal.pone.0296904. eCollection 2024.
How much information does a dataset contain about an outcome of interest? To answer this question, estimates are generated for a given dataset, representing the minimum possible absolute prediction error for an outcome variable that any model could achieve. The estimate is produced using a constrained omniscient model that mandates only that identical observations receive identical predictions, and that observations which are very similar to each other receive predictions that are alike. It is demonstrated that the resulting prediction accuracy bounds function effectively on both simulated data and real-world datasets. This method generates bounds on predictive performance typically within 10% of the performance of the true model, and performs well across a range of simulated and real datasets. Three applications of the methodology are discussed: measuring data quality, model evaluation, and quantifying the amount of irreducible error in a prediction problem.
一个数据集包含了多少关于感兴趣的结果的信息?为了回答这个问题,针对给定的数据集生成了估计值,这些估计值代表任何模型都可以实现的结果变量的最小绝对预测误差。该估计值是使用一种受约束的全知模型生成的,该模型仅要求相同的观测值得到相同的预测,并且非常相似的观测值得到相似的预测。结果表明,所得到的预测精度边界函数在模拟数据和真实数据集上都能有效地工作。该方法通常可以在真实模型性能的 10% 范围内生成预测性能的边界,并且在一系列模拟和真实数据集上表现良好。讨论了该方法的三个应用:测量数据质量、模型评估和量化预测问题中的不可约误差量。