Lung Cancer Unit, IRCCS Ospedale Policlinico San Martino, Genova, Italy.
Dipartimento di Scienze Mediche, Chirurgiche e Sperimentali, Università degli Studi di Sassari, Sassari, Italy.
Cancer Res. 2021 Feb 1;81(3):724-731. doi: 10.1158/0008-5472.CAN-20-0999. Epub 2020 Nov 4.
Radiomics is defined as the use of automated or semi-automated post-processing and analysis of multiple features derived from imaging exams. Extracted features might generate models able to predict the molecular profile of solid tumors. The aim of this study was to develop a predictive algorithm to define the mutational status of EGFR in treatment-naïve patients with advanced non-small cell lung cancer (NSCLC). CT scans from 109 treatment-naïve patients with NSCLC (21 -mutant and 88 -wild type) underwent radiomics analysis to develop a machine learning model able to recognize -mutant from -WT patients via CT scans. A "test-retest" approach was used to identify stable radiomics features. The accuracy of the model was tested on an external validation set from another institution and on a dataset from the Cancer Imaging Archive (TCIA). The machine learning model that considered both radiomic and clinical features (gender and smoking status) reached a diagnostic accuracy of 88.1% in our dataset with an AUC at the ROC curve of 0.85, whereas the accuracy values in the datasets from TCIA and the external institution were 76.6% and 83.3%, respectively. Furthermore, 17 distinct radiomics features detected at baseline CT scan were associated with subsequent development of T790M during treatment with an EGFR inhibitor. In conclusion, our machine learning model was able to identify -mutant patients in multiple validation sets with globally good accuracy, especially after data optimization. More comprehensive training sets might result in further improvement of radiomics-based algorithms. SIGNIFICANCE: These findings demonstrate that data normalization and "test-retest" methods might improve the performance of machine learning models on radiomics images and increase their reliability when used on external validation datasets.
放射组学是指使用自动化或半自动化的后处理和分析技术,从影像检查中提取多个特征。提取的特征可能会生成能够预测实体瘤分子谱的模型。本研究旨在开发一种预测算法,以定义初治的晚期非小细胞肺癌(NSCLC)患者中 EGFR 的突变状态。对 109 例初治 NSCLC 患者(21 例突变和 88 例野生型)的 CT 扫描进行放射组学分析,以开发一种能够通过 CT 扫描识别突变型和野生型患者的机器学习模型。采用“测试-再测试”方法来识别稳定的放射组学特征。该模型的准确性在另一个机构的外部验证集和癌症成像档案(TCIA)的数据集中进行了测试。考虑到放射组学和临床特征(性别和吸烟状况)的机器学习模型在我们的数据集上的诊断准确性为 88.1%,ROC 曲线下的 AUC 为 0.85,而 TCIA 和外部机构数据集的准确性值分别为 76.6%和 83.3%。此外,在接受 EGFR 抑制剂治疗期间,基线 CT 扫描上检测到的 17 个独特的放射组学特征与随后发生 T790M 相关。总之,我们的机器学习模型能够在多个验证集中以全球良好的准确性识别突变型患者,尤其是在数据优化后。更全面的训练集可能会进一步提高基于放射组学的算法的性能。意义:这些发现表明,数据归一化和“测试-再测试”方法可能会提高机器学习模型在放射组学图像上的性能,并在应用于外部验证数据集时提高其可靠性。