• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

交叉预测驱动的推理。

Cross-prediction-powered inference.

作者信息

Zrnic Tijana, Candès Emmanuel J

机构信息

Department of Statistics, Stanford University, Stanford, CA 94305.

Stanford Data Science, Stanford University, Stanford, CA 94305.

出版信息

Proc Natl Acad Sci U S A. 2024 Apr 9;121(15):e2322083121. doi: 10.1073/pnas.2322083121. Epub 2024 Apr 3.

DOI:10.1073/pnas.2322083121
PMID:38568975
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11009639/
Abstract

While reliable data-driven decision-making hinges on high-quality labeled data, the acquisition of quality labels often involves laborious human annotations or slow and expensive scientific measurements. Machine learning is becoming an appealing alternative as sophisticated predictive techniques are being used to quickly and cheaply produce large amounts of predicted labels; e.g., predicted protein structures are used to supplement experimentally derived structures, predictions of socioeconomic indicators from satellite imagery are used to supplement accurate survey data, and so on. Since predictions are imperfect and potentially biased, this practice brings into question the validity of downstream inferences. We introduce cross-prediction: a method for valid inference powered by machine learning. With a small labeled dataset and a large unlabeled dataset, cross-prediction imputes the missing labels via machine learning and applies a form of debiasing to remedy the prediction inaccuracies. The resulting inferences achieve the desired error probability and are more powerful than those that only leverage the labeled data. Closely related is the recent proposal of prediction-powered inference [A. N. Angelopoulos, S. Bates, C. Fannjiang, M. I. Jordan, T. Zrnic, , 669-674 (2023)], which assumes that a good pretrained model is already available. We show that cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model. Finally, we observe that cross-prediction gives more stable conclusions than its competitors; its CIs typically have significantly lower variability.

摘要

虽然可靠的数据驱动决策依赖于高质量的标注数据,但获取高质量标签往往涉及费力的人工标注或缓慢且昂贵的科学测量。随着复杂的预测技术被用于快速且低成本地生成大量预测标签,机器学习正成为一种有吸引力的替代方法;例如,预测的蛋白质结构被用于补充实验得出的结构,利用卫星图像预测社会经济指标被用于补充准确的调查数据,等等。由于预测并不完美且可能存在偏差,这种做法使下游推断的有效性受到质疑。我们引入交叉预测:一种由机器学习驱动的有效推断方法。利用一个小的标注数据集和一个大的未标注数据集,交叉预测通过机器学习估算缺失的标签,并应用一种去偏形式来纠正预测不准确的问题。由此产生的推断达到了所需的错误概率,并且比仅利用标注数据的推断更有效力。与之密切相关的是最近提出的预测驱动推断[A. N. 安杰洛普洛斯、S. 贝茨、C. 范江、M. I. 乔丹、T. 兹尔尼茨, ,669 - 674(2023)],它假设已经有一个良好的预训练模型。我们表明,交叉预测始终比预测驱动推断的一种变体更有效力,在该变体中,一部分标注数据被分离出来用于训练模型。最后,我们观察到交叉预测比其竞争对手给出的结论更稳定;其置信区间的变异性通常显著更低。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/34471182b5b7/pnas.2322083121fig10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/04ddd296d5b2/pnas.2322083121fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/f67080c7e7b6/pnas.2322083121fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/a29ca85995e5/pnas.2322083121fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/29f3093fbabf/pnas.2322083121fig04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/a74e04495b7d/pnas.2322083121fig05.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/d0737d3fe363/pnas.2322083121fig06.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/e933fa2ce2c6/pnas.2322083121fig07.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/f624c1f0138d/pnas.2322083121fig08.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/5c89ef972d42/pnas.2322083121fig09.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/34471182b5b7/pnas.2322083121fig10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/04ddd296d5b2/pnas.2322083121fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/f67080c7e7b6/pnas.2322083121fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/a29ca85995e5/pnas.2322083121fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/29f3093fbabf/pnas.2322083121fig04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/a74e04495b7d/pnas.2322083121fig05.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/d0737d3fe363/pnas.2322083121fig06.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/e933fa2ce2c6/pnas.2322083121fig07.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/f624c1f0138d/pnas.2322083121fig08.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/5c89ef972d42/pnas.2322083121fig09.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/767b/11009639/34471182b5b7/pnas.2322083121fig10.jpg

相似文献

1
Cross-prediction-powered inference.交叉预测驱动的推理。
Proc Natl Acad Sci U S A. 2024 Apr 9;121(15):e2322083121. doi: 10.1073/pnas.2322083121. Epub 2024 Apr 3.
2
Prediction-powered inference.预测驱动的推理。
Science. 2023 Nov 10;382(6671):669-674. doi: 10.1126/science.adi6000. Epub 2023 Nov 9.
3
Weakly Semi-supervised phenotyping using Electronic Health records.基于电子健康记录的弱监督表型研究
J Biomed Inform. 2022 Oct;134:104175. doi: 10.1016/j.jbi.2022.104175. Epub 2022 Sep 5.
4
Estimating the prevalence of diabetic retinopathy in electronic health records with massive missing labels.在存在大量缺失标签的电子健康记录中估计糖尿病视网膜病变的患病率。
Intell Based Med. 2024;10. doi: 10.1016/j.ibmed.2024.100154. Epub 2024 Jul 5.
5
CPSS: Fusing consistency regularization and pseudo-labeling techniques for semi-supervised deep cardiovascular disease detection using all unlabeled electrocardiograms.CPSS:利用所有未标记的心电图进行半监督深度心血管疾病检测的一致性正则化和伪标记技术融合。
Comput Methods Programs Biomed. 2024 Sep;254:108315. doi: 10.1016/j.cmpb.2024.108315. Epub 2024 Jul 4.
6
Methods for correcting inference based on outcomes predicted by machine learning.基于机器学习预测结果进行推理的校正方法。
Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30266-30275. doi: 10.1073/pnas.2001238117. Epub 2020 Nov 18.
7
Learning From Multiple Datasets With Heterogeneous and Partial Labels for Universal Lesion Detection in CT.从多数据集学习具有异质和部分标签的通用 CT 病变检测
IEEE Trans Med Imaging. 2021 Oct;40(10):2759-2770. doi: 10.1109/TMI.2020.3047598. Epub 2021 Sep 30.
8
Statistical Learning and Inference Is Impaired in the Nonclinical Continuum of Psychosis.统计学学习和推理在精神病非临床连续体中受损。
J Neurosci. 2020 Aug 26;40(35):6759-6769. doi: 10.1523/JNEUROSCI.0315-20.2020. Epub 2020 Jul 20.
9
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
10
Crowdsourcing for Machine Learning in Public Health Surveillance: Lessons Learned From Amazon Mechanical Turk.公共卫生监测中机器学习的众包:从亚马逊土耳其机器人学到的经验教训。
J Med Internet Res. 2022 Jan 18;24(1):e28749. doi: 10.2196/28749.

引用本文的文献

1
Artificial intelligence for modelling infectious disease epidemics.用于传染病流行建模的人工智能
Nature. 2025 Feb;638(8051):623-635. doi: 10.1038/s41586-024-08564-w. Epub 2025 Feb 19.
2
ipd: an R package for conducting inference on predicted data.ipd:一个用于对预测数据进行推断的R软件包。
Bioinformatics. 2025 Feb 4;41(2). doi: 10.1093/bioinformatics/btaf055.

本文引用的文献

1
Cross-validation: what does it estimate and how well does it do it?交叉验证:它估计的是什么,效果如何?
J Am Stat Assoc. 2024;119(546):1434-1445. doi: 10.1080/01621459.2023.2197686. Epub 2023 May 15.
2
Prediction-powered inference.预测驱动的推理。
Science. 2023 Nov 10;382(6671):669-674. doi: 10.1126/science.adi6000. Epub 2023 Nov 9.
3
ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis.用于文本挖掘和金属有机框架合成预测的ChatGPT化学助手
J Am Chem Soc. 2023 Aug 16;145(32):18048-18062. doi: 10.1021/jacs.3c05819. Epub 2023 Aug 7.
4
Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。
Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.
5
The structural context of posttranslational modifications at a proteome-wide scale.在蛋白质组范围内对翻译后修饰进行结构背景分析。
PLoS Biol. 2022 May 16;20(5):e3001636. doi: 10.1371/journal.pbio.3001636. eCollection 2022 May.
6
Highly accurate protein structure prediction for the human proteome.高精准度的人类蛋白质组蛋白结构预测。
Nature. 2021 Aug;596(7873):590-596. doi: 10.1038/s41586-021-03828-1. Epub 2021 Jul 22.
7
A generalizable and accessible approach to machine learning with global satellite imagery.利用全球卫星图像进行可推广和可访问的机器学习方法。
Nat Commun. 2021 Jul 20;12(1):4392. doi: 10.1038/s41467-021-24638-z.
8
Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.
9
Methods for correcting inference based on outcomes predicted by machine learning.基于机器学习预测结果进行推理的校正方法。
Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30266-30275. doi: 10.1073/pnas.2001238117. Epub 2020 Nov 18.
10
Satellite-based estimates reveal widespread forest degradation in the Amazon.卫星估算显示亚马逊地区广泛存在森林退化现象。
Glob Chang Biol. 2020 May;26(5):2956-2969. doi: 10.1111/gcb.15029. Epub 2020 Mar 6.