• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

A benchmarking framework and dataset for learning to defer in human-AI decision-making.

作者信息

Alves Jean V, Leitão Diogo, Jesus Sérgio, Sampaio Marco O P, Liébana Javier, Saleiro Pedro, Figueiredo Mário A T, Bizarro Pedro

机构信息

Feedzai, Coimbra, Portugal.

Instituto Superior Técnico, ULisboa, Lisboa, Portugal.

出版信息

Sci Data. 2025 Apr 23;12(1):506. doi: 10.1038/s41597-025-04664-y.

DOI:10.1038/s41597-025-04664-y
PMID:40268945
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12019285/
Abstract

Learning to Defer (L2D) algorithms improve human-AI collaboration by deferring decisions to human experts when they are likely to be more accurate than the AI model. These can be crucial in high-stakes tasks like fraud detection, where false negatives can cost victims their life savings. The primary challenge in training and evaluating these systems is the high cost of acquiring expert predictions, often leading to the use of simplistic simulated expert behavior in benchmarks. We introduce OpenL2D, a framework generating synthetic experts with adjustable decision-making processes and work capacity constraints for more realistic L2D testing. Applied to a public fraud detection dataset, OpenL2D creates the financial fraud alert review dataset (FiFAR), which contains predictions from 50 fraud analysts for 30 K instances. We show that FiFAR's synthetic experts are similar to real experts in metrics such as consistency and inter-expert agreement. Our L2D benchmark reveals that performance rankings of L2D algorithms vary significantly based on the available experts, highlighting the need to consider diverse expert behavior in L2D benchmarking.

摘要
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/5cdc75180d67/41597_2025_4664_Fig16_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/ee509127440f/41597_2025_4664_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/cd670c4ef183/41597_2025_4664_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/4d62db027526/41597_2025_4664_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/7ae319caa8b3/41597_2025_4664_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/2c3bf60ce648/41597_2025_4664_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/782bbc587dae/41597_2025_4664_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/7898eba25ff2/41597_2025_4664_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/a7391256fc68/41597_2025_4664_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/d6b7a62c786b/41597_2025_4664_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/51338fb82f36/41597_2025_4664_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/3a4e6b136ff7/41597_2025_4664_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/520f00aee0ef/41597_2025_4664_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/155d69c67615/41597_2025_4664_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/b3d4cc06460d/41597_2025_4664_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/f2cad36af674/41597_2025_4664_Fig15_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/5cdc75180d67/41597_2025_4664_Fig16_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/ee509127440f/41597_2025_4664_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/cd670c4ef183/41597_2025_4664_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/4d62db027526/41597_2025_4664_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/7ae319caa8b3/41597_2025_4664_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/2c3bf60ce648/41597_2025_4664_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/782bbc587dae/41597_2025_4664_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/7898eba25ff2/41597_2025_4664_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/a7391256fc68/41597_2025_4664_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/d6b7a62c786b/41597_2025_4664_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/51338fb82f36/41597_2025_4664_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/3a4e6b136ff7/41597_2025_4664_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/520f00aee0ef/41597_2025_4664_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/155d69c67615/41597_2025_4664_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/b3d4cc06460d/41597_2025_4664_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/f2cad36af674/41597_2025_4664_Fig15_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef6c/12019285/5cdc75180d67/41597_2025_4664_Fig16_HTML.jpg

相似文献

1
A benchmarking framework and dataset for learning to defer in human-AI decision-making.
Sci Data. 2025 Apr 23;12(1):506. doi: 10.1038/s41597-025-04664-y.
2
Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 International Skin Imaging Collaboration Grand Challenge.基于皮肤镜图像的皮肤癌诊断人工智能预测模型验证:2019 年国际皮肤成像协作挑战赛。
Lancet Digit Health. 2022 May;4(5):e330-e339. doi: 10.1016/S2589-7500(22)00021-8.
3
AbdomenAtlas: A large-scale, detailed-annotated, & multi-center dataset for efficient transfer learning and open algorithmic benchmarking.腹部图谱:一个大规模、详细注释、多中心的数据集,用于高效的迁移学习和开放算法基准测试。
Med Image Anal. 2024 Oct;97:103285. doi: 10.1016/j.media.2024.103285. Epub 2024 Jul 30.
4
Clinical validation of deep learning algorithms for radiotherapy targeting of non-small-cell lung cancer: an observational study.深度学习算法在非小细胞肺癌放射治疗靶区中的临床验证:一项观察性研究。
Lancet Digit Health. 2022 Sep;4(9):e657-e666. doi: 10.1016/S2589-7500(22)00129-7.
5
Diagnostic Decision-Making Variability Between Novice and Expert Optometrists for Glaucoma: Comparative Analysis to Inform AI System Design.青光眼诊断中新手验光师与专家验光师决策的变异性:为人工智能系统设计提供信息的比较分析
JMIR Med Inform. 2025 Jan 29;13:e63109. doi: 10.2196/63109.
6
Tufts Dental Database: A Multimodal Panoramic X-Ray Dataset for Benchmarking Diagnostic Systems.塔夫茨牙科数据库:用于基准诊断系统的多模态全景 X 射线数据集。
IEEE J Biomed Health Inform. 2022 Apr;26(4):1650-1659. doi: 10.1109/JBHI.2021.3117575. Epub 2022 Apr 14.
7
Consensus statements on the current landscape of artificial intelligence applications in endoscopy, addressing roadblocks, and advancing artificial intelligence in gastroenterology.关于人工智能在内窥镜检查中的当前应用情况、解决障碍以及推动胃肠病学领域人工智能发展的共识声明。
Gastrointest Endosc. 2025 Jan;101(1):2-9.e1. doi: 10.1016/j.gie.2023.12.003. Epub 2024 Apr 17.
8
DeepXplainer: An interpretable deep learning based approach for lung cancer detection using explainable artificial intelligence.深演析:一种基于可解释人工智能的用于肺癌检测的可解释深度学习方法。
Comput Methods Programs Biomed. 2024 Jan;243:107879. doi: 10.1016/j.cmpb.2023.107879. Epub 2023 Oct 24.
9
Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis.以临床为重点的多队列基准测试作为一种工具,用于对人工智能算法在基本胸部放射分析中的性能进行外部验证。
Sci Rep. 2022 Jul 27;12(1):12764. doi: 10.1038/s41598-022-16514-7.
10
Expert-level pediatric brain tumor segmentation in a limited data scenario with stepwise transfer learning.在有限数据场景下通过逐步迁移学习实现专家级小儿脑肿瘤分割
medRxiv. 2023 Sep 18:2023.06.29.23292048. doi: 10.1101/2023.06.29.23292048.

本文引用的文献

1
Deep Neural Networks and Tabular Data: A Survey.深度神经网络与表格数据:一项综述。
IEEE Trans Neural Netw Learn Syst. 2024 Jun;35(6):7499-7519. doi: 10.1109/TNNLS.2022.3229161. Epub 2024 Jun 3.
2
Inter-rater and intra-rater reliability and agreement of echocardiographic diagnosis of rheumatic heart disease using the World Heart Federation evidence-based criteria.使用世界心脏联盟循证标准对风湿性心脏病进行超声心动图诊断时的评分者间和评分者内信度及一致性。
Heart Asia. 2019 Jun 24;11(2):e011233. doi: 10.1136/heartasia-2019-011233. eCollection 2019.
3
Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.
深度学习算法在视网膜眼底照片糖尿病视网膜病变检测中的开发与验证。
JAMA. 2016 Dec 13;316(22):2402-2410. doi: 10.1001/jama.2016.17216.