Sobhan Masrur, Islam Md Mezbahul, Mondal Ananda Mohan
Knight Foundation School of Computing and Information Sciences Florida International University Miami, USA.
bioRxiv. 2025 Jan 13:2025.01.09.632292. doi: 10.1101/2025.01.09.632292.
Lung cancer is a leading cause of cancer-related mortality, with disparities in incidence and outcomes observed across different racial and sex groups. Understanding the genetic factors of these disparities is critical for developing targeted treatment therapies. This study aims to identify both patient-specific and cohort-specific biomarker genes that contribute to lung cancer health disparities among African American males (AAMs), European American males (EAMs), African American females (AAFs), and European American females (EAFs). The real-world data is highly imbalanced with respect to race, and the lung cancer dataset is no exception. So, classification with race labels will generate highly biased results toward the larger cohort. We developed a computational framework by designing the classification problems with disease conditions instead of races and leveraging the local interpretability of explainable AI, SHAP (SHapley Additive exPlanations). This study used three disease conditions of lung cancer, including Lung Adenocarcinoma (LUAD), Lung Squamous Cell Carcinoma (LUSC), and Healthy samples (HEALTHY) to design four classification tasks: one 3-class problem (LUAD-LUSC-HEALTHY) and three 2-class problems (LUAD-LUSC, LUAD-HEALTHY, and LUSC-HEALTHY). This multiple-classification approach allows a LUAD patient to be interrogated via three classification problems, namely LUAD-LUSC-HEALTHY, LUAD-LUSC, and LUAD-HEALTHY, thus providing a robust approach of retrieving disparity information for individual patients through the local interpretation of SHAP. The proposed method successfully discovered the sets of genes and pathways related to health disparities in lung cancer between two cohorts, including AAMs vs. EAMs, AAFs vs. EAFs, AAMs vs. AAFs, and EAMs vs. EAFs. The discovered list of genes and pathways provide a short list for biological scientists to conduct wet lab experiment.
肺癌是癌症相关死亡的主要原因,在不同种族和性别群体中,肺癌的发病率和治疗结果存在差异。了解这些差异的遗传因素对于开发靶向治疗方法至关重要。本研究旨在识别导致非裔美国男性(AAM)、欧裔美国男性(EAM)、非裔美国女性(AAF)和欧裔美国女性(EAF)肺癌健康差异的患者特异性和队列特异性生物标志物基因。真实世界的数据在种族方面高度不平衡,肺癌数据集也不例外。因此,按种族标签进行分类会对较大的队列产生高度偏差的结果。我们通过设计基于疾病状况而非种族的分类问题,并利用可解释人工智能SHAP(SHapley Additive exPlanations)的局部可解释性,开发了一个计算框架。本研究使用肺癌三种疾病状况,包括肺腺癌(LUAD)、肺鳞状细胞癌(LUSC)和健康样本(HEALTHY)来设计四个分类任务:一个3分类问题(LUAD-LUSC-HEALTHY)和三个2分类问题(LUAD-LUSC、LUAD-HEALTHY和LUSC-HEALTHY)。这种多分类方法允许通过三个分类问题对LUAD患者进行询问,即LUAD-LUSC-HEALTHY、LUAD-LUSC和LUAD-HEALTHY,从而通过SHAP的局部解释为个体患者检索差异信息提供了一种强大的方法。所提出的方法成功发现了两个队列之间肺癌健康差异相关的基因和通路集,包括AAM与EAM、AAF与EAF、AAM与AAF以及EAM与EAF。发现的基因和通路列表为生物科学家进行湿实验室实验提供了一个简短列表。