• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

校准基于Transformer的模型对社区参与研究的置信度:决策支持评估研究

Calibrating a Transformer-Based Model's Confidence on Community-Engaged Research Studies: Decision Support Evaluation Study.

作者信息

Ferrell Brian, Raskin Sarah E, Zimmerman Emily B

机构信息

Virginia Commonwealth University, Richmond, VA, United States.

L. Douglas Wilder School of Government and Public Affairs, Virginia Commonwealth University, Richmond, VA, United States.

出版信息

JMIR Form Res. 2023 Mar 20;7:e41516. doi: 10.2196/41516.

DOI:10.2196/41516
PMID:36939830
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10131979/
Abstract

BACKGROUND

Deep learning offers great benefits in classification tasks such as medical imaging diagnostics or stock trading, especially when compared with human-level performances, and can be a viable option for classifying distinct levels within community-engaged research (CEnR). CEnR is a collaborative approach between academics and community partners with the aim of conducting research that is relevant to community needs while incorporating diverse forms of expertise. In the field of deep learning and artificial intelligence (AI), training multiple models to obtain the highest validation accuracy is common practice; however, it can overfit toward that specific data set and not generalize well to a real-world population, which creates issues of bias and potentially dangerous algorithmic decisions. Consequently, if we plan on automating human decision-making, there is a need for creating techniques and exhaustive evaluative processes for these powerful unexplainable models to ensure that we do not incorporate and blindly trust poor AI models to make real-world decisions.

OBJECTIVE

We aimed to conduct an evaluation study to see whether our most accurate transformer-based models derived from previous studies could emulate our own classification spectrum for tracking CEnR studies as well as whether the use of calibrated confidence scores was meaningful.

METHODS

We compared the results from 3 domain experts, who classified a sample of 45 studies derived from our university's institutional review board database, with those from 3 previously trained transformer-based models, as well as investigated whether calibrated confidence scores can be a viable technique for using AI in a support role for complex decision-making systems.

RESULTS

Our findings reveal that certain models exhibit an overestimation of their performance through high confidence scores, despite not achieving the highest validation accuracy.

CONCLUSIONS

Future studies should be conducted with larger sample sizes to generalize the results more effectively. Although our study addresses the concerns of bias and overfitting in deep learning models, there is a need to further explore methods that allow domain experts to trust our models more. The use of a calibrated confidence score can be a misleading metric when determining our AI model's level of competency.

摘要

背景

深度学习在诸如医学影像诊断或股票交易等分类任务中带来了巨大益处,尤其是与人类水平的表现相比时,并且对于在社区参与研究(CEnR)中对不同级别进行分类而言可能是一个可行的选择。CEnR是学者与社区伙伴之间的一种协作方法,旨在开展与社区需求相关的研究,同时纳入多种形式的专业知识。在深度学习和人工智能(AI)领域,训练多个模型以获得最高的验证准确率是常见做法;然而,这可能会过度拟合该特定数据集,而不能很好地推广到实际人群,从而产生偏差问题以及潜在的危险算法决策。因此,如果我们计划实现人类决策自动化,就需要为这些强大的难以解释的模型创建技术和详尽的评估过程,以确保我们不会采用并盲目信任不良的AI模型来做出实际决策。

目的

我们旨在进行一项评估研究,以查看我们从先前研究中得出的最准确的基于Transformer的模型是否能够模拟我们自己用于跟踪CEnR研究的分类谱,以及使用校准后的置信度分数是否有意义。

方法

我们将3位领域专家对从我们大学机构审查委员会数据库中抽取的45项研究样本进行分类的结果,与3个先前训练的基于Transformer的模型的结果进行了比较,并研究了校准后的置信度分数是否可以作为一种可行的技术,用于在复杂决策系统中为AI提供支持作用。

结果

我们的研究结果表明,某些模型尽管未达到最高的验证准确率,但通过高置信度分数高估了它们的性能。

结论

未来的研究应该使用更大的样本量来更有效地推广结果。尽管我们的研究解决了深度学习模型中的偏差和过度拟合问题,但仍需要进一步探索使领域专家更信任我们模型的方法。在校定我们的AI模型的能力水平时,使用校准后的置信度分数可能是一个误导性指标。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5227/10131979/da38a8e028fd/formative_v7i1e41516_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5227/10131979/ec7cd5377959/formative_v7i1e41516_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5227/10131979/da38a8e028fd/formative_v7i1e41516_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5227/10131979/ec7cd5377959/formative_v7i1e41516_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5227/10131979/da38a8e028fd/formative_v7i1e41516_fig2.jpg

相似文献

1
Calibrating a Transformer-Based Model's Confidence on Community-Engaged Research Studies: Decision Support Evaluation Study.校准基于Transformer的模型对社区参与研究的置信度:决策支持评估研究
JMIR Form Res. 2023 Mar 20;7:e41516. doi: 10.2196/41516.
2
Attention-Based Models for Classifying Small Data Sets Using Community-Engaged Research Protocols: Classification System Development and Validation Pilot Study.使用社区参与研究协议对小数据集进行分类的基于注意力的模型:分类系统开发与验证试点研究
JMIR Form Res. 2022 Sep 6;6(9):e32460. doi: 10.2196/32460.
3
Fine-tuning Strategies for Classifying Community-Engaged Research Studies Using Transformer-Based Models: Algorithm Development and Improvement Study.使用基于Transformer的模型对社区参与研究进行分类的微调策略:算法开发与改进研究
JMIR Form Res. 2023 Feb 7;7:e41137. doi: 10.2196/41137.
4
The future of Cochrane Neonatal.考克兰新生儿协作网的未来。
Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.
5
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
6
A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance.深度学习模型在不同类别不平衡程度的非结构化医疗记录文本分类中的对比研究。
BMC Med Res Methodol. 2022 Jul 2;22(1):181. doi: 10.1186/s12874-022-01665-y.
7
Leveraging code-free deep learning for pill recognition in clinical settings: A multicenter, real-world study of performance across multiple platforms.利用无代码深度学习在临床环境中进行药丸识别:在多个平台上进行的多中心真实世界性能研究。
Artif Intell Med. 2024 Apr;150:102844. doi: 10.1016/j.artmed.2024.102844. Epub 2024 Mar 13.
8
Automatic text classification of actionable radiology reports of tinnitus patients using bidirectional encoder representations from transformer (BERT) and in-domain pre-training (IDPT).使用基于转换器的双向编码器表示 (BERT) 和领域内预训练 (IDPT) 对耳鸣患者的可操作放射学报告进行自动文本分类。
BMC Med Inform Decis Mak. 2022 Jul 30;22(1):200. doi: 10.1186/s12911-022-01946-y.
9
Advancing Dermatological Diagnostics: Interpretable AI for Enhanced Skin Lesion Classification.推进皮肤病诊断:用于增强皮肤病变分类的可解释人工智能。
Diagnostics (Basel). 2024 Apr 2;14(7):753. doi: 10.3390/diagnostics14070753.
10
Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.使用Transformer模型预测临床句子对之间的语义相似性:评估与表征分析
JMIR Med Inform. 2021 May 26;9(5):e23099. doi: 10.2196/23099.

引用本文的文献

1
Assessing the Capability of ChatGPT, Google Bard, and Microsoft Bing in Solving Radiology Case Vignettes.评估ChatGPT、谷歌巴德和微软必应解决放射学病例 vignettes的能力。
Indian J Radiol Imaging. 2023 Dec 29;34(2):276-282. doi: 10.1055/s-0043-1777746. eCollection 2024 Apr.

本文引用的文献

1
Fine-tuning Strategies for Classifying Community-Engaged Research Studies Using Transformer-Based Models: Algorithm Development and Improvement Study.使用基于Transformer的模型对社区参与研究进行分类的微调策略:算法开发与改进研究
JMIR Form Res. 2023 Feb 7;7:e41137. doi: 10.2196/41137.
2
Attention-Based Models for Classifying Small Data Sets Using Community-Engaged Research Protocols: Classification System Development and Validation Pilot Study.使用社区参与研究协议对小数据集进行分类的基于注意力的模型:分类系统开发与验证试点研究
JMIR Form Res. 2022 Sep 6;6(9):e32460. doi: 10.2196/32460.
3
Developing a classification system and algorithm to track community-engaged research using IRB protocols at a large research university.
在一所大型研究型大学开发一个分类系统和算法,以使用机构审查委员会(IRB)协议跟踪社区参与研究。
J Clin Transl Sci. 2021 Nov 22;6(1):e6. doi: 10.1017/cts.2021.877. eCollection 2022.
4
Longitudinal Screening for Diabetic Retinopathy in a Nationwide Screening Program: Comparing Deep Learning and Human Graders.全国性筛查项目中糖尿病视网膜病变的纵向筛查:深度学习与人工分级比较。
J Diabetes Res. 2020 Dec 15;2020:8839376. doi: 10.1155/2020/8839376. eCollection 2020.
5
A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis.深度学习在医学影像疾病检测方面的性能与医疗保健专业人员的比较:系统评价和荟萃分析。
Lancet Digit Health. 2019 Oct;1(6):e271-e297. doi: 10.1016/S2589-7500(19)30123-2. Epub 2019 Sep 25.
6
Incorporating Risk Factor Embeddings in Pre-trained Transformers Improves Sentiment Prediction in Psychiatric Discharge Summaries.将风险因素嵌入预训练的Transformer中可改善精神科出院小结中的情感预测。
Proc Conf Empir Methods Nat Lang Process. 2020 Nov;2020:35-40. doi: 10.18653/v1/2020.clinicalnlp-1.4.
7
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
8
Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study.比较人类读者和机器学习算法在色素性皮肤病变分类中的准确性:一项开放的、基于网络的、国际性的、诊断性研究。
Lancet Oncol. 2019 Jul;20(7):938-947. doi: 10.1016/S1470-2045(19)30333-X. Epub 2019 Jun 12.
9
Defining and Measuring Community Engagement and Community-Engaged Research: Clinical and Translational Science Institutional Practices.界定与衡量社区参与及社区参与研究:临床与转化科学机构实践
Prog Community Health Partnersh. 2018;12(2):145-156. doi: 10.1353/cpr.2018.0034.
10
How artificial intelligence is changing drug discovery.人工智能如何改变药物研发。
Nature. 2018 May;557(7707):S55-S57. doi: 10.1038/d41586-018-05267-x.