Vasilev Yuriy, Vladzymyrskyy Anton, Arzamasov Kirill, Omelyanskaya Olga, Shulkin Igor, Kozikhina Darya, Goncharova Inna, Reshetnikov Roman, Chetverikov Sergey, Blokhin Ivan, Bobrovskaya Tatiana, Andreychenko Anna
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Healthcare Department, Moscow, Russia.
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Healthcare Department, Moscow, Russia; I.M. Sechenov First Moscow State Medical University of the Ministry of Health of the Russian Federation (Sechenov University), 8-2 Trubetskaya str. Moscow, 119991, Russian Federation.
Int J Med Inform. 2023 Oct;178:105190. doi: 10.1016/j.ijmedinf.2023.105190. Epub 2023 Aug 9.
replicability and generalizability of medical AI are the recognized challenges that hinder a broad AI deployment in clinical practice. Pulmonary nodes detection and characterization based on chest CT images is one of the demanded use cases for automatization by means of AI, and multiple AI solutions addressing this task are becoming available. Here, we evaluated and compared the performance of several commercially available radiological AI with the same clinical task on the same external datasets acquired before and during the pandemic of COVID-19.
5 commercially available AI models for pulmonary nodule detection were tested on two external datasets labelled by experts according to the intended clinical task. Dataset1 was acquired before the pandemic and did not contain radiological signs of COVID-19; dataset2 was collected during the pandemic and did contain radiological signs of COVID-19. ROC-analysis was applied separately for the dataset1 and dataset2 to select probability thresholds for each dataset separately. AUROC, sensitivity and specificity metrics were used to assess and compare the results of AI performance.
Statistically significant differences in AUROC values were observed between the AI models for the dataset1. Whereas for the dataset2 the differences of AUROC values became statistically insignificant. Sensitivity and specificity differed statistically significantly between the AI models for the dataset1. This difference was insignificant for the dataset2 when we applied the probability threshold initially selected for the dataset1. An update of the probability threshold based on the dataset2 created statistically significant differences of sensitivity and specificity between AI models for the dataset2. For 3 out of 5 AI models, the update of the probability threshold was valuable to compensate for the degradation of AI model performances with the population shift caused by the pandemic.
Population shift in the data is able to deteriorate differences of AI models performance. Update of the probability threshold together with the population shift seems to be valuable to preserve AI models performance without retraining them.
医学人工智能的可重复性和可推广性是公认的挑战,阻碍了人工智能在临床实践中的广泛应用。基于胸部CT图像的肺结节检测和特征描述是人工智能自动化所需的用例之一,并且有多种人工智能解决方案可用于解决此任务。在此,我们在2019年冠状病毒病(COVID-19)大流行之前和期间获取的相同外部数据集上,对几种商用放射学人工智能在相同临床任务中的性能进行了评估和比较。
在两个由专家根据预期临床任务标记的外部数据集上,测试了5种用于肺结节检测的商用人工智能模型。数据集1在大流行之前获取,不包含COVID-19的放射学征象;数据集2在大流行期间收集,确实包含COVID-19的放射学征象。分别对数据集1和数据集2应用ROC分析,以分别为每个数据集选择概率阈值。使用AUROC、敏感性和特异性指标来评估和比较人工智能性能的结果。
在数据集1的人工智能模型之间,观察到AUROC值存在统计学上的显著差异。而对于数据集2,AUROC值的差异在统计学上变得不显著。在数据集1的人工智能模型之间,敏感性和特异性在统计学上有显著差异。当我们应用最初为数据集1选择的概率阈值时,这种差异在数据集2中不显著。基于数据集2更新概率阈值,在数据集2的人工智能模型之间产生了敏感性和特异性的统计学显著差异。对于5种人工智能模型中的3种,概率阈值的更新对于补偿因大流行导致的人群变化而引起的人工智能模型性能下降是有价值的。
数据中的人群变化能够恶化人工智能模型性能的差异。概率阈值的更新与人群变化一起,似乎对于在不重新训练的情况下保持人工智能模型性能是有价值的。