Zrnic Tijana, Candès Emmanuel J
Department of Statistics, Stanford University, Stanford, CA 94305.
Stanford Data Science, Stanford University, Stanford, CA 94305.
Proc Natl Acad Sci U S A. 2024 Apr 9;121(15):e2322083121. doi: 10.1073/pnas.2322083121. Epub 2024 Apr 3.
While reliable data-driven decision-making hinges on high-quality labeled data, the acquisition of quality labels often involves laborious human annotations or slow and expensive scientific measurements. Machine learning is becoming an appealing alternative as sophisticated predictive techniques are being used to quickly and cheaply produce large amounts of predicted labels; e.g., predicted protein structures are used to supplement experimentally derived structures, predictions of socioeconomic indicators from satellite imagery are used to supplement accurate survey data, and so on. Since predictions are imperfect and potentially biased, this practice brings into question the validity of downstream inferences. We introduce cross-prediction: a method for valid inference powered by machine learning. With a small labeled dataset and a large unlabeled dataset, cross-prediction imputes the missing labels via machine learning and applies a form of debiasing to remedy the prediction inaccuracies. The resulting inferences achieve the desired error probability and are more powerful than those that only leverage the labeled data. Closely related is the recent proposal of prediction-powered inference [A. N. Angelopoulos, S. Bates, C. Fannjiang, M. I. Jordan, T. Zrnic, , 669-674 (2023)], which assumes that a good pretrained model is already available. We show that cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model. Finally, we observe that cross-prediction gives more stable conclusions than its competitors; its CIs typically have significantly lower variability.
虽然可靠的数据驱动决策依赖于高质量的标注数据,但获取高质量标签往往涉及费力的人工标注或缓慢且昂贵的科学测量。随着复杂的预测技术被用于快速且低成本地生成大量预测标签,机器学习正成为一种有吸引力的替代方法;例如,预测的蛋白质结构被用于补充实验得出的结构,利用卫星图像预测社会经济指标被用于补充准确的调查数据,等等。由于预测并不完美且可能存在偏差,这种做法使下游推断的有效性受到质疑。我们引入交叉预测:一种由机器学习驱动的有效推断方法。利用一个小的标注数据集和一个大的未标注数据集,交叉预测通过机器学习估算缺失的标签,并应用一种去偏形式来纠正预测不准确的问题。由此产生的推断达到了所需的错误概率,并且比仅利用标注数据的推断更有效力。与之密切相关的是最近提出的预测驱动推断[A. N. 安杰洛普洛斯、S. 贝茨、C. 范江、M. I. 乔丹、T. 兹尔尼茨, ,669 - 674(2023)],它假设已经有一个良好的预训练模型。我们表明,交叉预测始终比预测驱动推断的一种变体更有效力,在该变体中,一部分标注数据被分离出来用于训练模型。最后,我们观察到交叉预测比其竞争对手给出的结论更稳定;其置信区间的变异性通常显著更低。