Samsung Genome Institute, Samsung Medical Center, Seoul, Korea.
Department of Health Science and Technology, Samsung Advanced Institute of Health Science and Technology, Sungkyunkwan University, Seoul, Korea.
PLoS One. 2019 Jul 16;14(7):e0219682. doi: 10.1371/journal.pone.0219682. eCollection 2019.
Intratumoral heterogeneity (ITH) refers to the presence of distinct tumor cell populations. It provides vital information for the clinical prognosis, drug responsiveness, and personalized treatment of cancer patients. As genomic ITH in various cancers affects the expression patterns of genes, the expression profile could be utilized for determining ITH level. Herein, we present a novel approach to directly detect high ITH defined as a larger number of subclones from the gene expression pattern through machine learning approaches. We examined associations between gene expression profile and ITH of 12 cancer types from The Cancer Genome Atlas (TCGA) database. Using stomach adenocarcinoma (STAD) showing high association, we evaluated the performance of our method in predicting ITH by employing three machine learning algorithms using gene expression profile data. We classified tumors into high and low heterogeneity groups using the learning model through the selection of LASSO feature. The result showed that support vector machines (SVMs) outperformed other algorithms (AUC = 0.84 in SVMs and 0.82 in Naïve Bayes) and we were able to improve predictive power by using both combined data from mutation and expression. Furthermore, we evaluated the prediction ability of each model using simulation data generated by mixing cell lines of the Cancer Cell Line Encyclopedia (CCLE), and obtained consistent results with using real dataset. Our approach could be utilized for discriminating tumors with heterogeneous cell populations to characterize ITH.
肿瘤内异质性(ITH)是指存在不同的肿瘤细胞群体。它为癌症患者的临床预后、药物反应性和个性化治疗提供了重要信息。由于各种癌症中的基因组 ITH 影响基因的表达模式,因此可以利用表达谱来确定 ITH 水平。在这里,我们提出了一种新的方法,通过机器学习方法直接从基因表达模式中检测到高 ITH(定义为更多的亚克隆)。我们检查了来自癌症基因组图谱(TCGA)数据库的 12 种癌症类型的基因表达谱与 ITH 之间的关联。使用与 ITH 高度相关的胃腺癌(STAD),我们通过使用三种机器学习算法并结合基因表达谱数据来评估我们的方法在预测 ITH 方面的性能。我们通过 LASSO 特征选择,使用学习模型将肿瘤分为高异质性和低异质性组。结果表明,支持向量机(SVMs)优于其他算法(SVMs 的 AUC = 0.84,朴素贝叶斯的 AUC = 0.82),并且通过使用突变和表达的组合数据,我们能够提高预测能力。此外,我们使用癌症细胞系百科全书(CCLE)的细胞系混合生成的模拟数据评估了每个模型的预测能力,并获得了与使用真实数据集一致的结果。我们的方法可用于区分具有异质细胞群体的肿瘤,以表征 ITH。