Datta Ankur, C George Priya Doss
Laboratory of Integrative Genomics, Department of Integrative Biology, School of BioSciences and Technology, Vellore Institute of Technology, Vellore, Tamil Nadu 632014, India.
Comput Biol Chem. 2025 Apr;115:108333. doi: 10.1016/j.compbiolchem.2024.108333. Epub 2024 Dec 27.
Patients with Non-Small Cell Lung Cancer (NSCLC) present a variety of clinical symptoms, such as dyspnea and chest pain, complicating accurate diagnosis. NSCLC includes subtypes distinguished by histological characteristics, specifically lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). This study aims to compare and identify abnormal gene expression patterns in LUAD and LUSC samples relative to adjacent healthy tissues using an explainable artificial intelligence (XAI) framework. The LASSO algorithm was employed to identify the top gene features in the LUAD and LUSC datasets. An ensemble-based extreme gradient boosting (XGBoost) machine learning (ML) algorithm was trained and interpreted using SHapley Additive exPlanations (SHAP), with top features undergoing biological annotation through survival and functional enrichment analyses. The XAI-based SHAP module addresses the opaque nature of ML models. Notably, 35 and 33 genes were identified for LUAD and LUSC, respectively, using the LASSO algorithm. Performance metrics such as average accuracy and Matthew's correlation coefficient were evaluated. The XGBoost model demonstrated an average accuracy of 99.1 % for LUAD and 98.6 % for LUSC. The SFTPC gene emerged as the most significant feature across both NSCLC subtypes. For LUAD, genes such as STX11, CLEC3B, EMP2, and LYVE1 significantly influenced the XAI-SHAP framework. Conversely, GKN2, OGN, SLC39A8, and MMRN1 were identified for LUSC. Survival analysis and functional validation of these genes highlighted the physiological functions observed to be dysregulated in the NSCLC subtypes. These identified genes have the potential to enhance current medical diagnostics and therapeutics.
非小细胞肺癌(NSCLC)患者会出现多种临床症状,如呼吸困难和胸痛,这使得准确诊断变得复杂。NSCLC包括根据组织学特征区分的亚型,具体为肺腺癌(LUAD)和肺鳞状细胞癌(LUSC)。本研究旨在使用可解释人工智能(XAI)框架比较并识别LUAD和LUSC样本相对于相邻健康组织的异常基因表达模式。采用LASSO算法识别LUAD和LUSC数据集中的顶级基因特征。使用SHapley加法解释(SHAP)对基于集成的极端梯度提升(XGBoost)机器学习(ML)算法进行训练和解释,通过生存和功能富集分析对顶级特征进行生物学注释。基于XAI的SHAP模块解决了ML模型的不透明性问题。值得注意的是,使用LASSO算法分别为LUAD和LUSC鉴定出35个和33个基因。评估了平均准确率和马修斯相关系数等性能指标。XGBoost模型对LUAD的平均准确率为99.1%,对LUSC的平均准确率为98.6%。SFTPC基因是两种NSCLC亚型中最显著的特征。对于LUAD,STX11、CLEC3B、EMP2和LYVE1等基因显著影响XAI-SHAP框架。相反,LUSC鉴定出了GKN2、OGN、SLC39A8和MMRN1。这些基因的生存分析和功能验证突出了在NSCLC亚型中观察到的失调生理功能。这些鉴定出的基因有可能增强当前的医学诊断和治疗方法。