Division of Orbital and Ophthalmic Plastic Surgery, Stein Eye Institute, University of California, Los Angeles.
Doheny Eye Institute, University of California, Los Angeles, Los Angeles.
Ophthalmic Plast Reconstr Surg. 2020 Mar/Apr;36(2):178-181. doi: 10.1097/IOP.0000000000001515.
To determine if crowdsourced ratings of oculoplastic surgical outcomes provide reliable information compared to professional graders and oculoplastic experts.
In this prospective psychometric evaluation, a scale for the rating of postoperative eyelid swelling was constructed using randomly selected images and topic experts. This scale was presented adjacent to 205 test images, including 10% duplicates. Graders were instructed to match the test image to the reference image it most closely resembles. Three sets of graders were solicited: crowdsourced lay people from Amazon Mechanical Turk marketplace, professional graders from the Doheny Image Reading Center (DIRC), and American Society of Ophthalmic Plastic and Reconstructive Surgery surgeons. Performance was assessed by classical correlational analysis and generalizability theory.
The correlation between scores on the first rating and the second rating for the 19 repeated occurrences was 0.60 for lay observers, 0.80 for DIRC graders and 0.84 for oculoplastic experts. In terms of inter-group rating reliability for all photos, the scores provided by lay observers were correlated with DIRC graders at a level of r = 0.88 and to experts at r = 0.79. The pictures themselves accounted for the greatest amount of variation among all groups. The amount of variation in the scores due to the rater was highest in the lay group at 25%, and was 20% and 21% for DIRC graders and experts, respectively.
Crowdsourced observers are insufficiently precise to replicate the results of experts in grading postoperative eyelid swelling. DIRC graders performed similarly to experts and present a less resource-intensive option.
确定众包评估眼整形手术结果的评分是否比专业分级员和眼整形专家提供的信息更可靠。
在这项前瞻性心理测量评估中,使用随机选择的图像和主题专家构建了用于评估术后眼睑肿胀的评分量表。该量表与 205 个测试图像一起呈现,其中包括 10%的重复图像。分级员被指示将测试图像与最相似的参考图像进行匹配。征集了三组分级员:来自亚马逊 Mechanical Turk 市场的众包非专业人士、来自 Doheny Image Reading Center(DIRC)的专业分级员和美国眼整形重建外科学会的外科医生。通过经典相关分析和概化理论评估绩效。
19 个重复出现的第一次评分和第二次评分之间的相关性为:非专业人士为 0.60,DIRC 分级员为 0.80,眼整形专家为 0.84。就所有照片的组间评分可靠性而言,非专业人士的评分与 DIRC 分级员的相关性为 r = 0.88,与专家的相关性为 r = 0.79。所有组中,图片本身的变异性最大。评分的变异性在非专业人士组中最高,为 25%,在 DIRC 分级员和专家中分别为 20%和 21%。
众包观察者不够精确,无法复制专家在评估术后眼睑肿胀方面的结果。DIRC 分级员的表现与专家相似,且资源密集度较低。