Chen Chun-Chia, Ting Wen-Chien, Lee Hsi-Chieh, Chang Chi-Chang, Lin Tsung-Chieh, Yang Shun-Fa
Institute of Medicine, Chung Shan Medical University, Taichung 40201, Taiwan.
Division of Plastic Surgery, Department of Surgery, Chi Mei Medical Center, Tainan 704, Taiwan.
Diagnostics (Basel). 2024 Apr 18;14(8):842. doi: 10.3390/diagnostics14080842.
This study used artificial intelligence techniques to identify clinical cancer biomarkers for recurrent gastric cancer survivors. From a hospital-based cancer registry database in Taiwan, the datasets of the incidence of recurrence and clinical risk features were included in 2476 gastric cancer survivors. We benchmarked Random Forest using MLP, C4.5, AdaBoost, and Bagging algorithms on metrics and leveraged the synthetic minority oversampling technique (SMOTE) for imbalanced dataset issues, cost-sensitive learning for risk assessment, and SHapley Additive exPlanations (SHAPs) for feature importance analysis in this study. Our proposed Random Forest outperformed the other models with an accuracy of 87.9%, a recall rate of 90.5%, an accuracy rate of 86%, and an F1 of 88.2% on the recurrent category by a 10-fold cross-validation in a balanced dataset. We identified clinical features of recurrent gastric cancer, which are the top five features, stage, number of regional lymph node involvement, , BMI (body mass index), and gender; these features significantly affect the prediction model's output and are worth paying attention to in the following causal effect analysis. Using an artificial intelligence model, the risk factors for recurrent gastric cancer could be identified and cost-effectively ranked according to their feature importance. In addition, they should be crucial clinical features to provide physicians with the knowledge to screen high-risk patients in gastric cancer survivors as well.
本研究使用人工智能技术来识别复发性胃癌幸存者的临床癌症生物标志物。从台湾一个基于医院的癌症登记数据库中,纳入了2476名胃癌幸存者的复发率和临床风险特征数据集。在本研究中,我们使用MLP、C4.5、AdaBoost和Bagging算法对随机森林进行了指标基准测试,并利用合成少数过采样技术(SMOTE)处理不平衡数据集问题,采用成本敏感学习进行风险评估,以及使用SHapley加性解释(SHAPs)进行特征重要性分析。我们提出的随机森林在平衡数据集中通过10倍交叉验证,在复发类别上的准确率为87.9%,召回率为90.5%,精确率为86%,F1值为88.2%,优于其他模型。我们确定了复发性胃癌的临床特征,即排名前五的特征:分期、区域淋巴结受累数量、体重指数(BMI)和性别;这些特征显著影响预测模型的输出,在接下来的因果效应分析中值得关注。使用人工智能模型,可以识别复发性胃癌的风险因素,并根据其特征重要性进行成本效益排序。此外,它们应该是关键的临床特征,也能为医生提供筛查胃癌幸存者中高危患者的知识。