游戏化众包作为一种新型的肺部超声数据集标注方法:前瞻性分析。

Gamified Crowdsourcing as a Novel Approach to Lung Ultrasound Data Set Labeling: Prospective Analysis.

机构信息

Department of Emergency Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, United States.

Centaur Labs, Boston, MA, United States.

出版信息

J Med Internet Res. 2024 Jul 4;26:e51397. doi: 10.2196/51397.

Abstract

BACKGROUND

Machine learning (ML) models can yield faster and more accurate medical diagnoses; however, developing ML models is limited by a lack of high-quality labeled training data. Crowdsourced labeling is a potential solution but can be constrained by concerns about label quality.

OBJECTIVE

This study aims to examine whether a gamified crowdsourcing platform with continuous performance assessment, user feedback, and performance-based incentives could produce expert-quality labels on medical imaging data.

METHODS

In this diagnostic comparison study, 2384 lung ultrasound clips were retrospectively collected from 203 emergency department patients. A total of 6 lung ultrasound experts classified 393 of these clips as having no B-lines, one or more discrete B-lines, or confluent B-lines to create 2 sets of reference standard data sets (195 training clips and 198 test clips). Sets were respectively used to (1) train users on a gamified crowdsourcing platform and (2) compare the concordance of the resulting crowd labels to the concordance of individual experts to reference standards. Crowd opinions were sourced from DiagnosUs (Centaur Labs) iOS app users over 8 days, filtered based on past performance, aggregated using majority rule, and analyzed for label concordance compared with a hold-out test set of expert-labeled clips. The primary outcome was comparing the labeling concordance of collated crowd opinions to trained experts in classifying B-lines on lung ultrasound clips.

RESULTS

Our clinical data set included patients with a mean age of 60.0 (SD 19.0) years; 105 (51.7%) patients were female and 114 (56.1%) patients were White. Over the 195 training clips, the expert-consensus label distribution was 114 (58%) no B-lines, 56 (29%) discrete B-lines, and 25 (13%) confluent B-lines. Over the 198 test clips, expert-consensus label distribution was 138 (70%) no B-lines, 36 (18%) discrete B-lines, and 24 (12%) confluent B-lines. In total, 99,238 opinions were collected from 426 unique users. On a test set of 198 clips, the mean labeling concordance of individual experts relative to the reference standard was 85.0% (SE 2.0), compared with 87.9% crowdsourced label concordance (P=.15). When individual experts' opinions were compared with reference standard labels created by majority vote excluding their own opinion, crowd concordance was higher than the mean concordance of individual experts to reference standards (87.4% vs 80.8%, SE 1.6 for expert concordance; P<.001). Clips with discrete B-lines had the most disagreement from both the crowd consensus and individual experts with the expert consensus. Using randomly sampled subsets of crowd opinions, 7 quality-filtered opinions were sufficient to achieve near the maximum crowd concordance.

CONCLUSIONS

Crowdsourced labels for B-line classification on lung ultrasound clips via a gamified approach achieved expert-level accuracy. This suggests a strategic role for gamified crowdsourcing in efficiently generating labeled image data sets for training ML systems.

摘要

背景

机器学习(ML)模型可以更快、更准确地做出医疗诊断;然而,由于缺乏高质量的标注训练数据,ML 模型的开发受到限制。众包标注是一种潜在的解决方案,但可能会受到对标签质量的担忧的限制。

目的

本研究旨在检验一个具有连续绩效评估、用户反馈和基于绩效的激励的游戏化众包平台是否可以在医学影像数据上产生专家级别的标签。

方法

在这项诊断比较研究中,从 203 名急诊科患者中回顾性收集了 2384 个肺部超声剪辑。共有 6 名肺部超声专家将其中的 393 个剪辑分类为无 B 线、一个或多个离散 B 线或融合 B 线,以创建 2 组参考标准数据集(195 个训练剪辑和 198 个测试剪辑)。这些数据集分别用于(1)在游戏化众包平台上培训用户,以及(2)比较由此产生的众包标签与个别专家对参考标准的一致性。众包意见来自 Centaur Labs 的 DiagnosUs(Centaur Labs)iOS 应用程序用户,使用基于过去表现的过滤器,使用多数规则进行汇总,并分析与专家标记剪辑的测试集的标签一致性。主要结果是比较整理后的众包意见与在肺部超声剪辑上分类 B 线的训练专家的标注一致性。

结果

我们的临床数据集包括平均年龄为 60.0(标准差 19.0)岁的患者;105 名(51.7%)患者为女性,114 名(56.1%)患者为白人。在 195 个训练剪辑中,专家共识标签分布为 114 个(58%)无 B 线,56 个(29%)离散 B 线和 25 个(13%)融合 B 线。在 198 个测试剪辑中,专家共识标签分布为 138 个(70%)无 B 线,36 个(18%)离散 B 线和 24 个(12%)融合 B 线。总共从 426 个不同的用户那里收集了 99238 条意见。在 198 个剪辑的测试集中,相对于参考标准,个别专家的平均标注一致性为 85.0%(SE 2.0),而众包标签一致性为 87.9%(P=.15)。当将个别专家的意见与排除其自身意见的多数投票创建的参考标准标签进行比较时,众包一致性高于个别专家对参考标准的平均一致性(87.4%对 80.8%,专家一致性的 SE 为 1.6;P<.001)。离散 B 线的剪辑与来自众包共识和个别专家的意见都有最多的分歧。使用随机抽样的众包意见子集,7 条经过质量过滤的意见就足以达到接近最高的众包一致性。

结论

通过游戏化方法对肺部超声剪辑上的 B 线分类进行众包标注,达到了专家级别的准确性。这表明游戏化众包在为训练 ML 系统生成标注图像数据集方面具有战略作用。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索