Kentley Jonathan, Weber Jochen, Liopyris Konstantinos, Braun Ralph P, Marghoob Ashfaq A, Quigley Elizabeth A, Nelson Kelly, Prentice Kira, Duhaime Erik, Halpern Allan C, Rotemberg Veronica
Department of Dermatology, Chelsea and Westminster Hospital, London, United Kingdom.
Dermatology Section, Memorial Sloan Kettering Cancer Center, New York, NY, United States.
JMIR Med Inform. 2023 Jan 18;11:e38412. doi: 10.2196/38412.
Dermoscopy is commonly used for the evaluation of pigmented lesions, but agreement between experts for identification of dermoscopic structures is known to be relatively poor. Expert labeling of medical data is a bottleneck in the development of machine learning (ML) tools, and crowdsourcing has been demonstrated as a cost- and time-efficient method for the annotation of medical images.
The aim of this study is to demonstrate that crowdsourcing can be used to label basic dermoscopic structures from images of pigmented lesions with similar reliability to a group of experts.
First, we obtained labels of 248 images of melanocytic lesions with 31 dermoscopic "subfeatures" labeled by 20 dermoscopy experts. These were then collapsed into 6 dermoscopic "superfeatures" based on structural similarity, due to low interrater reliability (IRR): dots, globules, lines, network structures, regression structures, and vessels. These images were then used as the gold standard for the crowd study. The commercial platform DiagnosUs was used to obtain annotations from a nonexpert crowd for the presence or absence of the 6 superfeatures in each of the 248 images. We replicated this methodology with a group of 7 dermatologists to allow direct comparison with the nonexpert crowd. The Cohen κ value was used to measure agreement across raters.
In total, we obtained 139,731 ratings of the 6 dermoscopic superfeatures from the crowd. There was relatively lower agreement for the identification of dots and globules (the median κ values were 0.526 and 0.395, respectively), whereas network structures and vessels showed the highest agreement (the median κ values were 0.581 and 0.798, respectively). This pattern was also seen among the expert raters, who had median κ values of 0.483 and 0.517 for dots and globules, respectively, and 0.758 and 0.790 for network structures and vessels. The median κ values between nonexperts and thresholded average-expert readers were 0.709 for dots, 0.719 for globules, 0.714 for lines, 0.838 for network structures, 0.818 for regression structures, and 0.728 for vessels.
This study confirmed that IRR for different dermoscopic features varied among a group of experts; a similar pattern was observed in a nonexpert crowd. There was good or excellent agreement for each of the 6 superfeatures between the crowd and the experts, highlighting the similar reliability of the crowd for labeling dermoscopic images. This confirms the feasibility and dependability of using crowdsourcing as a scalable solution to annotate large sets of dermoscopic images, with several potential clinical and educational applications, including the development of novel, explainable ML tools.
皮肤镜检查常用于色素性皮损的评估,但已知专家之间对皮肤镜结构识别的一致性相对较差。医学数据的专家标注是机器学习(ML)工具开发的一个瓶颈,而众包已被证明是一种经济高效的医学图像标注方法。
本研究旨在证明众包可用于对色素性皮损图像中的基本皮肤镜结构进行标注,其可靠性与一组专家相当。
首先,我们获得了20位皮肤镜专家对248张黑素细胞性皮损图像的标注,其中包含31个皮肤镜“子特征”。由于评分者间信度(IRR)较低,这些子特征随后基于结构相似性被合并为6个皮肤镜“超特征”:点状、球状、线状、网状结构、退行性结构和血管。然后,这些图像被用作人群研究的金标准。使用商业平台DiagnosUs从非专业人群中获取对248张图像中每个图像6个超特征存在与否的标注。我们对7位皮肤科医生重复了这一方法,以便与非专业人群进行直接比较。使用Cohen κ值来衡量评分者之间的一致性。
我们总共从人群中获得了139731次对6个皮肤镜超特征的评分。在点状和球状的识别上一致性相对较低(中位数κ值分别为0.526和0.395),而网状结构和血管显示出最高的一致性(中位数κ值分别为0.581和0.798)。在专家评分者中也观察到了这种模式,他们对点状和球状的中位数κ值分别为0.483和0.517,对网状结构和血管的中位数κ值分别为0.758和0.790。非专业人群与经过阈值处理的平均专家读者之间的中位数κ值,点状为0.709,球状为0.719,线状为0.714,网状结构为0.838,退行性结构为0.818,血管为0.728。
本研究证实,一组专家对不同皮肤镜特征的IRR各不相同;在非专业人群中也观察到了类似模式。人群与专家对6个超特征中的每一个都有良好或极好的一致性,突出了人群在标注皮肤镜图像方面的相似可靠性。这证实了使用众包作为一种可扩展的解决方案来标注大量皮肤镜图像的可行性和可靠性,具有多种潜在的临床和教育应用,包括开发新型的、可解释的ML工具。