From the Division of Nuclear Medicine and Molecular Imaging, Geneva University Hospital, Geneva, Switzerland.
Clin Nucl Med. 2024 Dec 1;49(12):1079-1090. doi: 10.1097/RLU.0000000000005526. Epub 2024 Oct 21.
We propose a fully automated framework to conduct a region-wise image quality assessment (IQA) on whole-body 18 F-FDG PET scans. This framework (1) can be valuable in daily clinical image acquisition procedures to instantly recognize low-quality scans for potential rescanning and/or image reconstruction, and (2) can make a significant impact in dataset collection for the development of artificial intelligence-driven 18 F-FDG PET analysis models by rejecting low-quality images and those presenting with artifacts, toward building clean datasets.
Two experienced nuclear medicine physicians separately evaluated the quality of 174 18 F-FDG PET images from 87 patients, for each body region, based on a 5-point Likert scale. The body regisons included the following: (1) the head and neck, including the brain, (2) the chest, (3) the chest-abdomen interval (diaphragmatic region), (4) the abdomen, and (5) the pelvis. Intrareader and interreader reproducibility of the quality scores were calculated using 39 randomly selected scans from the dataset. Utilizing a binarized classification, images were dichotomized into low-quality versus high-quality for physician quality scores ≤3 versus >3, respectively. Inputting the 18 F-FDG PET/CT scans, our proposed fully automated framework applies 2 deep learning (DL) models on CT images to perform region identification and whole-body contour extraction (excluding extremities), then classifies PET regions as low and high quality. For classification, 2 mainstream artificial intelligence-driven approaches, including machine learning (ML) from radiomic features and DL, were investigated. All models were trained and evaluated on scores attributed by each physician, and the average of the scores reported. DL and radiomics-ML models were evaluated on the same test dataset. The performance evaluation was carried out on the same test dataset for radiomics-ML and DL models using the area under the curve, accuracy, sensitivity, and specificity and compared using the Delong test with P values <0.05 regarded as statistically significant.
In the head and neck, chest, chest-abdomen interval, abdomen, and pelvis regions, the best models achieved area under the curve, accuracy, sensitivity, and specificity of [0.97, 0.95, 0.96, and 0.95], [0.85, 0.82, 0.87, and 0.76], [0.83, 0.76, 0.68, and 0.80], [0.73, 0.72, 0.64, and 0.77], and [0.72, 0.68, 0.70, and 0.67], respectively. In all regions, models revealed highest performance, when developed on the quality scores with higher intrareader reproducibility. Comparison of DL and radiomics-ML models did not show any statistically significant differences, though DL models showed overall improved trends.
We developed a fully automated and human-perceptive equivalent model to conduct region-wise IQA over 18 F-FDG PET images. Our analysis emphasizes the necessity of developing separate models for body regions and performing data annotation based on multiple experts' consensus in IQA studies.
我们提出了一种全自动框架,用于对全身 18 F-FDG PET 扫描进行区域图像质量评估 (IQA)。该框架 (1) 可以在日常临床图像采集过程中非常有价值,可即时识别低质量扫描,以便进行潜在的重扫和/或图像重建,以及 (2) 通过拒绝低质量图像和存在伪影的图像,对人工智能驱动的 18 F-FDG PET 分析模型的开发产生重大影响,从而构建清洁数据集。
两位有经验的核医学医师分别基于 5 分制量表对 87 名患者的 174 个 18 F-FDG PET 图像的每个身体区域的质量进行评估。身体区域包括:(1) 头颈部,包括大脑,(2) 胸部,(3) 胸部-腹部间隔 (膈肌区域),(4) 腹部,和 (5) 骨盆。从数据集随机选择 39 个扫描,计算每个区域的质量评分的内部读者和外部读者的可重复性。利用二进制分类,将医师质量评分≤3 与>3 的图像分别分为低质量和高质量。我们提出的全自动框架输入 18 F-FDG PET/CT 扫描,利用 2 个深度学习 (DL) 模型对 CT 图像进行区域识别和全身轮廓提取(不包括四肢),然后将 PET 区域分类为低质量和高质量。对于分类,研究了包括基于放射组学特征的机器学习 (ML) 和 DL 在内的 2 种主流人工智能驱动方法。所有模型均在每位医师分配的分数以及报告的平均分数上进行训练和评估。在相同的测试数据集上评估放射组学-ML 和 DL 模型。使用曲线下面积、准确性、敏感度和特异性对放射组学-ML 和 DL 模型进行相同的测试数据集评估,并使用 Delong 检验比较,P 值<0.05 被认为具有统计学意义。
在头颈部、胸部、胸部-腹部间隔、腹部和骨盆区域,最佳模型的曲线下面积、准确性、敏感度和特异性分别为 [0.97、0.95、0.96 和 0.95]、[0.85、0.82、0.87 和 0.76]、[0.83、0.76、0.68 和 0.80]、[0.73、0.72、0.64 和 0.77] 和 [0.72、0.68、0.70 和 0.67]。在所有区域中,当针对具有更高内部读者可重复性的质量评分开发模型时,模型的性能最高。尽管 DL 模型总体上呈现出改进的趋势,但 DL 和放射组学-ML 模型之间的比较并未显示出任何统计学上的显著差异。
我们开发了一种全自动且与人类感知等效的模型,用于对 18 F-FDG PET 图像进行区域图像质量评估。我们的分析强调了在 IQA 研究中,有必要针对身体区域开发单独的模型,并基于多位专家的共识进行数据注释。