PES Center for Pattern Recognition, Department of Computer Science and Engineering, PES University, Bengaluru, 560085, India.
Department of Computer Science and Engineering, PES University Electronic City Campus, Bengaluru, 560100, India.
BMC Bioinformatics. 2023 Jun 7;24(1):241. doi: 10.1186/s12859-023-05347-4.
RNA sequencing (RNA-Seq) is a technique that utilises the capabilities of next-generation sequencing to study a cellular transcriptome i.e., to determine the amount of RNA at a given time for a given biological sample. The advancement of RNA-Seq technology has resulted in a large volume of gene expression data for analysis.
Our computational model (built on top of TabNet) is first pretrained on an unlabelled dataset of multiple types of adenomas and adenocarcinomas and later fine-tuned on the labelled dataset, showing promising results in the context of the estimation of the vital status of colorectal cancer patients. We achieve a final cross-validated (ROC-AUC) Score of 0.88 by using multiple modalities of data.
The results of this study demonstrate that self-supervised learning methods pretrained on a vast corpus of unlabelled data outperform traditional supervised learning methods such as XGBoost, Neural Networks, and Decision Trees that have been prevalent in the tabular domain. The results of this study are further boosted by the inclusion of multiple modalities of data pertaining to the patients in question. We find that genes such as RBM3, GSPT1, MAD2L1, and others important to the computation model's prediction task obtained through model interpretability corroborate with pathological evidence in current literature.
RNA 测序(RNA-Seq)是一种利用下一代测序技术来研究细胞转录组的技术,即确定给定生物样本在给定时间的 RNA 量。RNA-Seq 技术的进步产生了大量用于分析的基因表达数据。
我们的计算模型(建立在 TabNet 之上)首先在多种腺瘤和腺癌的无标签数据集上进行预训练,然后在有标签数据集上进行微调,在估计结直肠癌患者的生存状态方面取得了有希望的结果。我们通过使用多种数据模态实现了最终的交叉验证(ROC-AUC)得分为 0.88。
这项研究的结果表明,在大量无标签数据上进行预训练的自监督学习方法优于传统的监督学习方法,如在表格领域中流行的 XGBoost、神经网络和决策树。通过纳入与患者相关的多种数据模态,进一步提高了研究结果。我们发现,计算模型的预测任务中重要的基因,如 RBM3、GSPT1、MAD2L1 等,通过模型可解释性获得,与当前文献中的病理证据相符。