Suppr超能文献

深度学习算法在非小细胞肺癌放射治疗靶区中的临床验证:一项观察性研究。

Clinical validation of deep learning algorithms for radiotherapy targeting of non-small-cell lung cancer: an observational study.

机构信息

Artificial Intelligence in Medicine Program, Mass General Brigham, Harvard Medical School, Boston, MA, USA; Department of Radiation Oncology, Brigham and Women's Hospital and Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA.

Artificial Intelligence in Medicine Program, Mass General Brigham, Harvard Medical School, Boston, MA, USA; Department of Radiation Oncology, Brigham and Women's Hospital and Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA; Computational Health Informatics Program, Boston Children's Hospital, Boston, MA.

出版信息

Lancet Digit Health. 2022 Sep;4(9):e657-e666. doi: 10.1016/S2589-7500(22)00129-7.

Abstract

BACKGROUND

Artificial intelligence (AI) and deep learning have shown great potential in streamlining clinical tasks. However, most studies remain confined to in silico validation in small internal cohorts, without external validation or data on real-world clinical utility. We developed a strategy for the clinical validation of deep learning models for segmenting primary non-small-cell lung cancer (NSCLC) tumours and involved lymph nodes in CT images, which is a time-intensive step in radiation treatment planning, with large variability among experts.

METHODS

In this observational study, CT images and segmentations were collected from eight internal and external sources from the USA, the Netherlands, Canada, and China, with patients from the Maastro and Harvard-RT1 datasets used for model discovery (segmented by a single expert). Validation consisted of interobserver and intraobserver benchmarking, primary validation, functional validation, and end-user testing on the following datasets: multi-delineation, Harvard-RT1, Harvard-RT2, RTOG-0617, NSCLC-radiogenomics, Lung-PET-CT-Dx, RIDER, and thorax phantom. Primary validation consisted of stepwise testing on increasingly external datasets using measures of overlap including volumetric dice (VD) and surface dice (SD). Functional validation explored dosimetric effect, model failure modes, test-retest stability, and accuracy. End-user testing with eight experts assessed automated segmentations in a simulated clinical setting.

FINDINGS

We included 2208 patients imaged between 2001 and 2015, with 787 patients used for model discovery and 1421 for model validation, including 28 patients for end-user testing. Models showed an improvement over the interobserver benchmark (multi-delineation dataset; VD 0·91 [IQR 0·83-0·92], p=0·0062; SD 0·86 [0·71-0·91], p=0·0005), and were within the intraobserver benchmark. For primary validation, AI performance on internal Harvard-RT1 data (segmented by the same expert who segmented the discovery data) was VD 0·83 (IQR 0·76-0·88) and SD 0·79 (0·68-0·88), within the interobserver benchmark. Performance on internal Harvard-RT2 data segmented by other experts was VD 0·70 (0·56-0·80) and SD 0·50 (0·34-0·71). Performance on RTOG-0617 clinical trial data was VD 0·71 (0·60-0·81) and SD 0·47 (0·35-0·59), with similar results on diagnostic radiology datasets NSCLC-radiogenomics and Lung-PET-CT-Dx. Despite these geometric overlap results, models yielded target volumes with equivalent radiation dose coverage to those of experts. We also found non-significant differences between de novo expert and AI-assisted segmentations. AI assistance led to a 65% reduction in segmentation time (5·4 min; p<0·0001) and a 32% reduction in interobserver variability (SD; p=0·013).

INTERPRETATION

We present a clinical validation strategy for AI models. We found that in silico geometric segmentation metrics might not correlate with clinical utility of the models. Experts' segmentation style and preference might affect model performance.

FUNDING

US National Institutes of Health and EU European Research Council.

摘要

背景

人工智能(AI)和深度学习在简化临床任务方面显示出巨大的潜力。然而,大多数研究仍然局限于小内部队列的计算机模拟验证,没有外部验证或关于真实临床效用的数据。我们开发了一种用于对 CT 图像中原发性非小细胞肺癌(NSCLC)肿瘤和相关淋巴结进行分割的深度学习模型进行临床验证的策略,这是放射治疗计划中耗时的步骤,专家之间存在很大的差异。

方法

在这项观察性研究中,从美国、荷兰、加拿大和中国的八个内部和外部来源收集了 CT 图像和分割数据,其中 Maastro 和 Harvard-RT1 数据集的患者用于模型发现(由单个专家进行分割)。验证包括观察者间和观察者内基准测试、初步验证、功能验证和以下数据集的最终用户测试:多描绘、Harvard-RT1、Harvard-RT2、RTOG-0617、NSCLC-radiogenomics、Lung-PET-CT-Dx、RIDER 和胸部体模。初步验证包括使用包括体积骰子(VD)和表面骰子(SD)在内的重叠度量,在越来越多的外部数据集上逐步进行测试。功能验证探索了剂量学效应、模型故障模式、测试-再测试稳定性和准确性。八位专家进行的最终用户测试评估了模拟临床环境中的自动分割。

结果

我们纳入了 2001 年至 2015 年期间成像的 2208 名患者,其中 787 名患者用于模型发现,1421 名患者用于模型验证,包括 28 名患者用于最终用户测试。与观察者间基准相比,模型表现有所提高(多描绘数据集;VD 0.91[IQR 0.83-0.92],p=0.0062;SD 0.86[0.71-0.91],p=0.0005),并且在观察者内基准内。对于初步验证,AI 在内部 Harvard-RT1 数据上的性能(由与发现数据相同的专家进行分割)为 VD 0.83(IQR 0.76-0.88)和 SD 0.79(0.68-0.88),在观察者间基准内。其他专家对内部 Harvard-RT2 数据的分割表现为 VD 0.70(0.56-0.80)和 SD 0.50(0.34-0.71)。在 RTOG-0617 临床试验数据上的表现为 VD 0.71(0.60-0.81)和 SD 0.47(0.35-0.59),在诊断放射学数据集 NSCLC-radiogenomics 和 Lung-PET-CT-Dx 上也有类似的结果。尽管存在这些几何重叠结果,但模型生成的目标体积与专家的辐射剂量覆盖范围相当。我们还发现专家和 AI 辅助分割之间没有显著差异。AI 辅助可以将分割时间减少 65%(5.4 分钟;p<0.0001),观察者间变异(SD)减少 32%(p=0.013)。

解释

我们提出了一种用于 AI 模型的临床验证策略。我们发现,计算机模拟的分割度量可能与模型的临床实用性不相关。专家的分割风格和偏好可能会影响模型性能。

资金

美国国立卫生研究院和欧盟欧洲研究理事会。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b239/9435511/8ad54dbc648f/nihms-1832205-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验