使用拉施模型提高众包数据的一致性评分：一种诊断工具的开发与完善

Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument.

作者信息

Brady Christopher John, Mudie Lucy Iluka, Wang Xueyang, Guallar Eliseo, Friedman David Steven

机构信息

Dana Center for Preventive Ophthalmology, Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, MD, United States.

Bloomberg School of Public Health, Department of Epidemiology, Johns Hopkins University, Baltimore, MD, United States.

出版信息

J Med Internet Res. 2017 Jun 20;19(6):e222. doi: 10.2196/jmir.7984.

DOI:10.2196/jmir.7984

PMID:28634154

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5497070/

Abstract

BACKGROUND

Diabetic retinopathy (DR) is a leading cause of vision loss in working age individuals worldwide. While screening is effective and cost effective, it remains underutilized, and novel methods are needed to increase detection of DR. This clinical validation study compared diagnostic gradings of retinal fundus photographs provided by volunteers on the Amazon Mechanical Turk (AMT) crowdsourcing marketplace with expert-provided gold-standard grading and explored whether determination of the consensus of crowdsourced classifications could be improved beyond a simple majority vote (MV) using regression methods.

OBJECTIVE

The aim of our study was to determine whether regression methods could be used to improve the consensus grading of data collected by crowdsourcing.

METHODS

A total of 1200 retinal images of individuals with diabetes mellitus from the Messidor public dataset were posted to AMT. Eligible crowdsourcing workers had at least 500 previously approved tasks with an approval rating of 99% across their prior submitted work. A total of 10 workers were recruited to classify each image as normal or abnormal. If half or more workers judged the image to be abnormal, the MV consensus grade was recorded as abnormal. Rasch analysis was then used to calculate worker ability scores in a random 50% training set, which were then used as weights in a regression model in the remaining 50% test set to determine if a more accurate consensus could be devised. Outcomes of interest were the percent correctly classified images, sensitivity, specificity, and area under the receiver operating characteristic (AUROC) for the consensus grade as compared with the expert grading provided with the dataset.

RESULTS

Using MV grading, the consensus was correct in 75.5% of images (906/1200), with 75.5% sensitivity, 75.5% specificity, and an AUROC of 0.75 (95% CI 0.73-0.78). A logistic regression model using Rasch-weighted individual scores generated an AUROC of 0.91 (95% CI 0.88-0.93) compared with 0.89 (95% CI 0.86-92) for a model using unweighted scores (chi-square P value<.001). Setting a diagnostic cut-point to optimize sensitivity at 90%, 77.5% (465/600) were graded correctly, with 90.3% sensitivity, 68.5% specificity, and an AUROC of 0.79 (95% CI 0.76-0.83).

CONCLUSIONS

Crowdsourced interpretations of retinal images provide rapid and accurate results as compared with a gold-standard grading. Creating a logistic regression model using Rasch analysis to weight crowdsourced classifications by worker ability improves accuracy of aggregated grades as compared with simple majority vote.

摘要

背景

糖尿病视网膜病变（DR）是全球劳动年龄人群视力丧失的主要原因。虽然筛查有效且具有成本效益，但仍未得到充分利用，需要新的方法来提高DR的检测率。这项临床验证研究将亚马逊土耳其机器人（AMT）众包市场上志愿者提供的视网膜眼底照片诊断分级与专家提供的金标准分级进行了比较，并探讨了使用回归方法是否可以在简单多数投票（MV）之外改进众包分类共识的确定。

目的

我们研究的目的是确定是否可以使用回归方法来改进众包收集数据的共识分级。

方法

将来自Messidor公共数据集的1200张糖尿病患者的视网膜图像发布到AMT上。符合条件的众包工作者之前至少有500项已批准的任务，其之前提交工作的批准率为99%。总共招募了10名工作者将每张图像分类为正常或异常。如果一半或更多工作者判断图像为异常，则将MV共识分级记录为异常。然后使用Rasch分析在随机的50%训练集中计算工作者能力得分，然后将其用作剩余50%测试集中回归模型的权重，以确定是否可以设计出更准确的共识。感兴趣的结果是与数据集中提供的专家分级相比，共识分级的正确分类图像百分比、敏感性、特异性和受试者操作特征曲线下面积（AUROC）。

结果

使用MV分级时，75.5%的图像（906/1200）的共识是正确的，敏感性为75.5%，特异性为75.5%，AUROC为0.75（95%CI 0.73 - 0.78）。使用Rasch加权个体得分的逻辑回归模型产生的AUROC为0.91（95%CI 0.88 - 0.93），而使用未加权得分的模型的AUROC为0.89（95%CI 0.86 - 0.92）（卡方P值<0.001）。将诊断切点设定为在90%时优化敏感性，77.5%（465/600）的分级正确，敏感性为90.3%，特异性为68.5%，AUROC为0.79（95%CI 0.76 - 0.83）。

结论

与金标准分级相比，视网膜图像的众包解释提供了快速准确的结果。使用Rasch分析创建逻辑回归模型以根据工作者能力对众包分类进行加权，与简单多数投票相比提高了汇总分级的准确性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用拉施模型提高众包数据的一致性评分：一种诊断工具的开发与完善

Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

使用拉施模型提高众包数据的一致性评分：一种诊断工具的开发与完善

Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献