Department of Electrical and Computer Engineering, University of Miami, Miami, FL 33146, United States.
Department of Otolaryngology, University of Miami, Miami, FL 33146, United States.
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae544.
Recent advancements in image classification have demonstrated that contrastive learning (CL) can aid in further learning tasks by acquiring good feature representation from a limited number of data samples. In this paper, we applied CL to tumor transcriptomes and clinical data to learn feature representations in a low-dimensional space. We then utilized these learned features to train a classifier to categorize tumors into a high- or low-risk group of recurrence. Using data from The Cancer Genome Atlas (TCGA), we demonstrated that CL can significantly improve classification accuracy. Specifically, our CL-based classifiers achieved an area under the receiver operating characteristic curve (AUC) greater than 0.8 for 14 types of cancer, and an AUC greater than 0.9 for 3 types of cancer. We also developed CL-based Cox (CLCox) models for predicting cancer prognosis. Our CLCox models trained with the TCGA data outperformed existing methods significantly in predicting the prognosis of 19 types of cancer under consideration. The performance of CLCox models and CL-based classifiers trained with TCGA lung and prostate cancer data were validated using the data from two independent cohorts. We also show that the CLCox model trained with the whole transcriptome significantly outperforms the Cox model trained with the 16 genes of Oncotype DX that is in clinical use for breast cancer patients. The trained models and the Python codes are publicly accessible and provide a valuable resource that will potentially find clinical applications for many types of cancer.
最近在图像分类方面的进展表明,对比学习(CL)可以通过从有限数量的数据样本中获取良好的特征表示,来帮助进一步学习任务。在本文中,我们将 CL 应用于肿瘤转录组和临床数据,以在低维空间中学习特征表示。然后,我们利用这些学习到的特征来训练分类器,将肿瘤分为高复发风险或低复发风险组。使用来自癌症基因组图谱(TCGA)的数据,我们证明 CL 可以显著提高分类准确性。具体来说,我们基于 CL 的分类器在 14 种癌症类型中实现了大于 0.8 的接收器操作特征曲线(AUC)下面积,在 3 种癌症类型中实现了大于 0.9 的 AUC。我们还开发了基于 CL 的 Cox(CLCox)模型来预测癌症预后。我们使用 TCGA 数据训练的 CLCox 模型在预测所考虑的 19 种癌症的预后方面明显优于现有方法。使用来自两个独立队列的数据验证了基于 TCGA 肺和前列腺癌数据训练的 CLCox 模型和基于 CL 的分类器的性能。我们还表明,使用整个转录组训练的 CLCox 模型明显优于临床用于乳腺癌患者的 Oncotype DX 的 16 个基因训练的 Cox 模型。训练好的模型和 Python 代码是公开可用的,为许多类型的癌症提供了有价值的资源,可能会找到临床应用。