Computer Science Department, Faculty of Computing and Information Technology, King Abdulaziz University, 80200, Jeddah, Saudi Arabia.
Center of Excellence in Genomic Medicine Research, King Abdulaziz University, 21589, Jeddah, Saudi Arabia.
Sci Rep. 2023 Nov 30;13(1):21114. doi: 10.1038/s41598-023-47805-2.
Circulating tumor cells (CTCs) are cancer cells that detach from the primary tumor and intravasate into the bloodstream. Thus, non-invasive liquid biopsies are being used to analyze CTC-expressed genes to identify potential cancer biomarkers. In this regard, several studies have used gene expression changes in blood to predict the presence of CTC and, consequently, cancer. However, the CTC mRNA data has not been used to develop a generic approach that indicates the presence of multiple cancer types. In this study, we developed such a generic approach. Briefly, we designed two computational workflows, one using the raw mRNA data and deep learning (DL) and the other exploiting five hub gene ranking algorithms (Degree, Maximum Neighborhood Component, Betweenness Centrality, Closeness Centrality, and Stress Centrality) with machine learning (ML). Both workflows aim to determine the top genes that best distinguish cancer types based on the CTC mRNA data. We demonstrate that our automated, robust DL framework (DNNraw) more accurately indicates the presence of multiple cancer types using the CTC gene expression data than multiple ML approaches. The DL approach achieved average precision of 0.9652, recall of 0.9640, f1-score of 0.9638 and overall accuracy of 0.9640. Furthermore, since we designed multiple approaches, we also provide a bioinformatics analysis of the gene commonly identified as top-ranked by the different methods. To our knowledge, this is the first study wherein a generic approach has been developed to predict the presence of multiple cancer types using raw CTC mRNA data, as opposed to other models that require a feature selection step.
循环肿瘤细胞 (CTCs) 是从原发性肿瘤上脱落并进入血液的癌细胞。因此,非侵入性液体活检被用于分析 CTC 表达的基因,以鉴定潜在的癌症生物标志物。在这方面,几项研究已经使用血液中的基因表达变化来预测 CTC 的存在,进而预测癌症。然而,CTC mRNA 数据尚未用于开发一种通用方法来指示多种癌症类型的存在。在这项研究中,我们开发了这样一种通用方法。简而言之,我们设计了两种计算工作流程,一种使用原始 mRNA 数据和深度学习 (DL),另一种利用机器学习 (ML) 的五个枢纽基因排名算法(Degree、Maximum Neighborhood Component、Betweenness Centrality、Closeness Centrality 和 Stress Centrality)。这两种工作流程旨在根据 CTC mRNA 数据确定最佳区分癌症类型的顶级基因。我们证明,我们的自动化、稳健的 DL 框架(DNNraw)比多种 ML 方法更准确地使用 CTC 基因表达数据指示多种癌症类型的存在。DL 方法的平均精度为 0.9652,召回率为 0.9640,f1 得分为 0.9638,整体准确率为 0.9640。此外,由于我们设计了多种方法,我们还提供了对不同方法普遍识别为顶级基因的基因的生物信息学分析。据我们所知,这是第一项使用原始 CTC mRNA 数据开发通用方法来预测多种癌症类型存在的研究,而不是其他需要特征选择步骤的模型。