Integrated Cancer Research Center, School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA.
School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
BMC Bioinformatics. 2020 Sep 30;21(Suppl 14):364. doi: 10.1186/s12859-020-03690-4.
Machine learning has been utilized to predict cancer drug response from multi-omics data generated from sensitivities of cancer cell lines to different therapeutic compounds. Here, we build machine learning models using gene expression data from patients' primary tumor tissues to predict whether a patient will respond positively or negatively to two chemotherapeutics: 5-Fluorouracil and Gemcitabine.
We focused on 5-Fluorouracil and Gemcitabine because based on our exclusion criteria, they provide the largest numbers of patients within TCGA. Normalized gene expression data were clustered and used as the input features for the study. We used matching clinical trial data to ascertain the response of these patients via multiple classification methods. Multiple clustering and classification methods were compared for prediction accuracy of drug response. Clara and random forest were found to be the best clustering and classification methods, respectively. The results show our models predict with up to 86% accuracy; despite the study's limitation of sample size. We also found the genes most informative for predicting drug response were enriched in well-known cancer signaling pathways and highlighted their potential significance in chemotherapy prognosis.
Primary tumor gene expression is a good predictor of cancer drug response. Investment in larger datasets containing both patient gene expression and drug response is needed to support future work of machine learning models. Ultimately, such predictive models may aid oncologists with making critical treatment decisions.
机器学习已被用于从癌细胞系对不同治疗化合物的敏感性产生的多组学数据中预测癌症药物反应。在这里,我们使用来自患者原发肿瘤组织的基因表达数据构建机器学习模型,以预测患者对两种化疗药物:5-氟尿嘧啶和吉西他滨的反应是阳性还是阴性。
我们专注于 5-氟尿嘧啶和吉西他滨,因为根据我们的排除标准,它们在 TCGA 中提供了最多的患者数量。归一化的基因表达数据被聚类并用作该研究的输入特征。我们使用匹配的临床试验数据通过多种分类方法确定这些患者的反应。比较了多种聚类和分类方法以预测药物反应的准确性。Clara 和随机森林分别被发现是最好的聚类和分类方法。结果表明,我们的模型预测的准确性高达 86%;尽管研究的样本量有限。我们还发现,预测药物反应最有用的基因在已知的癌症信号通路中富集,并强调了它们在化疗预后中的潜在意义。
原发肿瘤基因表达是癌症药物反应的良好预测指标。需要投资更大的数据集,其中包含患者的基因表达和药物反应,以支持未来的机器学习模型工作。最终,这种预测模型可能有助于肿瘤学家做出关键的治疗决策。