Cu Cassandra W, Dundas Nicole E, Heintz Timothy, Sheikh Zahida A, Alonso-Bermudez Bianca, Walker Jasmine, Wooten Avery, Badathala Anusha, Chapman Allyson, Ehie Odinakachukwu, Raghunathan Karthik, Mills Hunter, Espejo Edie, Boscardin John, Wallace Arthur W, Cobert Julien
School of Medicine, Tufts University School of Medicine, Boston, MA, USA.
UC Berkeley Department of Bioengineering, Berkeley, CA, USA.
NPJ Digit Med. 2025 Oct 3;8(1):595. doi: 10.1038/s41746-025-01975-7.
Skin tone assessments are critical for fairness evaluation in healthcare algorithms (e.g., pulse oximetry) but lack validation. Using prospectively collected facial images from 90 hospitalized adults at the San Francisco VA, three independent annotators rated facial regions in triplicate using Fitzpatrick (I-VI) and Monk (1-10) skin tone scales. Patients also self-identified their skin tone. Annotator confidence was recorded using 5-point Likert scales. Across 810 images in 90 patients (9 images each), within-rater agreement was high, but inter-annotator agreement was moderate to low. Annotators frequently rated patients as darker when patients self-identified as lighter, and lighter when patients self-identified as darker. In linear mixed-effects models controlling for facial region and annotator confidence, darker self-reported skin tones were associated with lighter annotator scores. These findings highlight challenges in consistent skin tone labeling and suggest that current methods for assessing representation in biosensor-based algorithm studies may be influenced by labeling bias.
肤色评估对于医疗保健算法(如脉搏血氧饱和度测定)中的公平性评估至关重要,但缺乏验证。利用从旧金山退伍军人事务部前瞻性收集的90名住院成年人的面部图像,三名独立注释者使用菲茨帕特里克(I-VI)和蒙克(1-10)肤色量表对面部区域进行了三次评分。患者也自行确定了自己的肤色。使用5点李克特量表记录注释者的信心。在90名患者的810张图像(每人9张)中,评分者内部一致性较高,但注释者之间的一致性为中度至低度。当患者自行确定肤色较浅时,注释者经常将其评为较深;而当患者自行确定肤色较深时,注释者则将其评为较浅。在控制面部区域和注释者信心的线性混合效应模型中,自我报告的较深肤色与注释者较低的评分相关。这些发现凸显了在一致的肤色标注方面的挑战,并表明当前基于生物传感器的算法研究中评估代表性的方法可能受到标注偏差的影响。