Department of Surgery, NYU Langone Hospital Long Island, Mineola, New York.
NYU Kimmel Hyperbaric and Advanced Wound Healing Center, New York, New York.
JAMA Netw Open. 2021 May 3;4(5):e217234. doi: 10.1001/jamanetworkopen.2021.7234.
Accurate assessment of wound area and percentage of granulation tissue (PGT) are important for optimizing wound care and healing outcomes. Artificial intelligence (AI)-based wound assessment tools have the potential to improve the accuracy and consistency of wound area and PGT measurement, while improving efficiency of wound care workflows.
To develop a quantitative and qualitative method to evaluate AI-based wound assessment tools compared with expert human assessments.
DESIGN, SETTING, AND PARTICIPANTS: This diagnostic study was performed across 2 independent wound centers using deidentified wound photographs collected for routine care (site 1, 110 photographs taken between May 1 and 31, 2018; site 2, 89 photographs taken between January 1 and December 31, 2019). Digital wound photographs of patients were selected chronologically from the electronic medical records from the general population of patients visiting the wound centers. For inclusion in the study, the complete wound edge and a ruler were required to be visible; circumferential ulcers were specifically excluded. Four wound specialists (2 per site) and an AI-based wound assessment service independently traced wound area and granulation tissue.
The quantitative performance of AI tracings was evaluated by statistically comparing error measure distributions between test AI traces and reference human traces (AI vs human) with error distributions between independent traces by 2 humans (human vs human). Quantitative outcomes included statistically significant differences in error measures of false-negative area (FNA), false-positive area (FPA), and absolute relative error (ARE) between AI vs human and human vs human comparisons of wound area and granulation tissue tracings. Six masked attending physician reviewers (3 per site) viewed randomized area tracings for AI and human annotators and qualitatively assessed them. Qualitative outcomes included statistically significant difference in the absolute difference between AI-based PGT measurements and mean reviewer visual PGT estimates compared with PGT estimate variability measures (ie, range, standard deviation) across reviewers.
A total of 199 photographs were selected for the study across both sites; mean (SD) patient age was 64 (18) years (range, 17-95 years) and 127 (63.8%) were women. The comparisons of AI vs human with human vs human for FPA and ARE were not statistically significant. AI vs human FNA was slightly elevated compared with human vs human FNA (median [IQR], 7.7% [2.7%-21.2%] vs 5.7% [1.6%-14.9%]; P < .001), indicating that AI traces tended to slightly underestimate the human reference wound boundaries compared with human test traces. Two of 6 reviewers had a statistically higher frequency in agreement that human tracings met the standard area definition, but overall agreement was moderate (352 yes responses of 583 total responses [60.4%] for AI and 793 yes responses of 1166 total responses [68.0%] for human tracings). AI PGT measurements fell in the typical range of variation in interreviewer visual PGT estimates; however, visual PGT estimates varied considerably (mean range, 34.8%; mean SD, 19.6%).
This study provides a framework for evaluating AI-based digital wound assessment tools that can be extended to automated measurements of other wound features or adapted to evaluate other AI-based digital image diagnostic tools. As AI-based wound assessment tools become more common across wound care settings, it will be important to rigorously validate their performance in helping clinicians obtain accurate wound assessments to guide clinical care.
准确评估伤口面积和肉芽组织百分比(PGT)对于优化伤口护理和愈合结果非常重要。基于人工智能(AI)的伤口评估工具具有提高伤口面积和 PGT 测量准确性和一致性的潜力,同时提高伤口护理工作流程的效率。
开发一种定量和定性的方法来评估基于人工智能的伤口评估工具与专家人工评估的比较。
设计、地点和参与者:这项诊断研究在两个独立的伤口中心进行,使用为常规护理收集的匿名伤口照片(地点 1,2018 年 5 月 1 日至 31 日拍摄的 110 张照片;地点 2,2019 年 1 月 1 日至 12 月 31 日拍摄的 89 张照片)。从访问伤口中心的患者的电子病历中按时间顺序选择数字伤口照片。为了纳入研究,需要可见完整的伤口边缘和标尺;特别排除了环状溃疡。四名伤口专家(每个地点 2 名)和一个基于人工智能的伤口评估服务独立地追踪伤口面积和肉芽组织。
通过统计学比较测试 AI 轨迹和参考人工轨迹(AI 与人类)之间的误差分布与两名人类独立轨迹(人类与人类)之间的误差分布,评估 AI 轨迹的定量性能。定量结果包括 AI 与人类比较和人类与人类比较的伤口面积和肉芽组织追踪的假阴性面积(FNA)、假阳性面积(FPA)和绝对相对误差(ARE)的统计显著差异。六名蒙面主治医生评审员(每个地点 3 名)随机查看 AI 和人类注释者的面积轨迹,并对其进行定性评估。定性结果包括 AI 基于 PGT 测量与平均评审员视觉 PGT 估计值之间的绝对差异,以及跨评审员的 PGT 估计值变异性测量值(即范围、标准差)之间的统计学显著差异。
两个地点共选择了 199 张照片进行研究;平均(SD)患者年龄为 64(18)岁(范围,17-95 岁),127(63.8%)为女性。AI 与人类的 FPA 和 ARE 比较与人类与人类的比较没有统计学意义。与人类与人类的 FNA 相比,AI 与人类的 FNA 略高(中位数[IQR],7.7%[2.7%-21.2%]比 5.7%[1.6%-14.9%];P<0.001),这表明与人类测试轨迹相比,AI 轨迹倾向于略微低估人类参考伤口边界。两名评审员中有两名在同意人类轨迹符合标准面积定义的频率上具有统计学意义,但总体一致性为中度(AI 有 583 次总响应中有 352 次是肯定的,而人类有 1166 次总响应中有 793 次是肯定的)。AI PGT 测量值落在典型的评审员视觉 PGT 估计值变化范围内;然而,视觉 PGT 估计值变化很大(平均范围,34.8%;平均标准差,19.6%)。
本研究提供了一种评估基于人工智能的数字伤口评估工具的框架,该框架可以扩展到其他伤口特征的自动测量,或适应评估其他基于人工智能的数字图像诊断工具。随着基于人工智能的伤口评估工具在伤口护理环境中变得越来越普遍,严格验证其性能以帮助临床医生获得准确的伤口评估以指导临床护理将非常重要。