Filos Dimitrios, Xinou Ekaterini, Chouvarda Ioanna
Laboratory of Computing, Medical Informatics and Biomedical Imaging Technologies, School of Medicine, Aristotle University of Thessaloniki, Greece.
Laboratory of Computing, Medical Informatics and Biomedical Imaging Technologies, School of Medicine, Aristotle University of Thessaloniki, Greece; Theageneio Cancer Hospital of Thessaloniki, Greece.
Comput Methods Programs Biomed. 2025 Aug 20;271:109029. doi: 10.1016/j.cmpb.2025.109029.
Lung cancer is the leading cause of cancer-related mortality globally. Early detection of high-risk patients for local or distant metastasis is challenging for better monitoring and treatment planning. Machine learning models have been proposed for diagnosis and prediction of metastasis risk. However, limited data, due to long follow-up period, and lack of multi center cohorts affect model generalization as they overlook the impact of various data heterogeneity.
The aim of this study is to develop a generalizable radiomics-based model, capable of predicting the occurrence of metastasis within two years of the initial lung cancer diagnosis combining supervised techniques with semi-supervised and privileged learning approaches.
Data from 114 lung cancer patients, without metastasis during diagnosis, were used for model training. Computed tomography (CT) images, acquired from different vendor machines, segmentation of the tumor and clinical data were available for each patient. Radiomic features were extracted from the primary tumor as well as from an area reflecting the tumor rim, covering both peri- and extra-tumoral regions of diverse widths. Feature harmonization was applied among the features from different CT scanners. A Support Vector Machines (SVM) model with hyperparameter tuning was trained using the most important radiomics and clinical features that were selected by iterative recursive feature elimination. Semi-supervised learning was used to increase the training dataset, where metastasis labels were missing. Finally, privileged learning was applied to train the SVM using additional clinical data that may not be available in the testing dataset. All models were evaluated using 21 patients from two clinical sites. Human derived predictions were compared with the AI-based ones.
The width of the tumor rim that maximized the balanced accuracy (BA) during the training of the SVM model extended 3mm inside and 3mm outside the tumor edge. The selected radiomic features had higher values in patients with metastasis, suggesting a higher heterogeneity of the analysis areas. The SVM model had a BA of 76.7 % with high specificity and low sensitivity (86.7 % and 66.7 %, respectively) in the external dataset. The application of semi-supervised learning improved performance, while the inclusion of privileged learning techniques led to a BA of 81.65 % improving both specificity and sensitivity (80 % and 83.3 % respectively). This model outperformed the human expert in terms of BA for data, in a small dataset, from one clinical site.
The rim size analysis provided insights into tumor aggressiveness and metastasis occurrence, especially when combined with the patient's stage and tumor characteristics. Integrating semi-supervised and privileged learning techniques significantly enhanced SVM models' performance in identifying lung cancer patients at risk for metastasis, outperforming human-derived predictions in a small external dataset. The model's generalizability was demonstrated on an external validation set from two different regions. More extensive validation is necessary to confirm these findings.
肺癌是全球癌症相关死亡的主要原因。早期发现有局部或远处转移风险的高危患者对于更好地进行监测和治疗规划具有挑战性。已经提出了机器学习模型用于转移风险的诊断和预测。然而,由于随访期长导致数据有限,且缺乏多中心队列,影响了模型的泛化能力,因为它们忽略了各种数据异质性的影响。
本研究的目的是开发一种基于影像组学的可泛化模型,该模型能够结合监督技术与半监督和特权学习方法,预测肺癌初次诊断后两年内转移的发生情况。
来自114例肺癌患者的数据(诊断时无转移)用于模型训练。为每位患者提供了从不同厂商机器获取的计算机断层扫描(CT)图像、肿瘤分割和临床数据。从原发性肿瘤以及反映肿瘤边缘的区域提取影像组学特征,该区域覆盖不同宽度的肿瘤周围和肿瘤外区域。对来自不同CT扫描仪的特征进行特征归一化处理。使用通过迭代递归特征消除选择的最重要的影像组学和临床特征,训练具有超参数调整的支持向量机(SVM)模型。使用半监督学习来增加训练数据集,其中转移标签缺失。最后,应用特权学习使用测试数据集中可能不可用的额外临床数据来训练SVM。所有模型均使用来自两个临床站点的21例患者进行评估。将人工预测与基于人工智能的预测进行比较。
在SVM模型训练期间使平衡准确率(BA)最大化的肿瘤边缘宽度在肿瘤边缘内侧扩展3mm,外侧扩展3mm。所选的影像组学特征在发生转移的患者中具有更高的值,表明分析区域具有更高的异质性。SVM模型在外部数据集中的BA为76.7%,特异性高而敏感性低(分别为86.7%和66.7%)。半监督学习的应用提高了性能,而纳入特权学习技术导致BA达到81.65%,特异性和敏感性均有所提高(分别为80%和83.3%)。在来自一个临床站点的小数据集中,该模型在数据的BA方面优于人类专家。
边缘大小分析为肿瘤侵袭性和转移发生提供了见解,特别是与患者的分期和肿瘤特征相结合时。整合半监督和特权学习技术显著提高了SVM模型在识别有转移风险的肺癌患者方面的性能,在一个小的外部数据集中优于人工预测。该模型的泛化能力在来自两个不同区域的外部验证集上得到了证明。需要更广泛的验证来证实这些发现。