Zhang Hongyu, Jiang Limin, Tang Jijun, Ding Yijie
School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.
Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States.
Front Cell Dev Biol. 2021 Mar 5;9:615747. doi: 10.3389/fcell.2021.615747. eCollection 2021.
In recent years, cancer has become a severe threat to human health. If we can accurately identify the subtypes of cancer, it will be of great significance to the research of anti-cancer drugs, the development of personalized treatment methods, and finally conquer cancer. In this paper, we obtain three feature representation datasets (gene expression profile, isoform expression and DNA methylation data) on lung cancer and renal cancer from the Broad GDAC, which collects the standardized data extracted from The Cancer Genome Atlas (TCGA). Since the feature dimension is too large, Principal Component Analysis (PCA) is used to reduce the feature vector, thus eliminating the redundant features and speeding up the operation speed of the classification model. By multiple kernel learning (MKL), we use Kernel target alignment (KTA), fast kernel learning (FKL), Hilbert-Schmidt Independence Criterion (HSIC), Mean to calculate the weight of kernel fusion. Finally, we put the combined kernel function into the support vector machine (SVM) and get excellent results. Among them, in the classification of renal cell carcinoma subtypes, the maximum accuracy can reach 0.978 by using the method of MKL (HSIC calculation weight), while in the classification of lung cancer subtypes, the accuracy can even reach 0.990 with the same method (FKL calculation weight).
近年来,癌症已成为对人类健康的严重威胁。如果我们能够准确识别癌症的亚型,这对于抗癌药物的研究、个性化治疗方法的开发以及最终攻克癌症都将具有重要意义。在本文中,我们从Broad GDAC获取了关于肺癌和肾癌的三个特征表示数据集(基因表达谱、异构体表达和DNA甲基化数据),该数据库收集了从癌症基因组图谱(TCGA)中提取的标准化数据。由于特征维度过大,我们使用主成分分析(PCA)来减少特征向量,从而消除冗余特征并加快分类模型的运算速度。通过多核学习(MKL),我们使用核目标对齐(KTA)、快速核学习(FKL)、希尔伯特-施密特独立性准则(HSIC)、均值来计算核融合的权重。最后,我们将组合后的核函数放入支持向量机(SVM)中并取得了优异的结果。其中,在肾细胞癌亚型分类中,使用MKL(HSIC计算权重)方法的最大准确率可达0.978,而在肺癌亚型分类中,使用相同方法(FKL计算权重)准确率甚至可达0.990。