Seo Mi-Kyoung, Paik Soonmyung, Kim Sangwoo
Department of Biomedical Systems Informatics, Brain Korea 21 PLUS Project for Medical Science, Yonsei University College of Medicine, Seoul 03722, Korea.
Severance Biomedical Science Institute, Yonsei University College of Medicine, Seoul 03722, Korea.
Cancers (Basel). 2020 Nov 25;12(12):3506. doi: 10.3390/cancers12123506.
While intrinsic molecular subtypes provide important biological classification of breast cancer, the subtype assignment of individuals is influenced by assay technology and study cohort composition. We sought to develop a platform-independent absolute single-sample subtype classifier based on a minimal number of genes. Pairwise ratios for subtype-specific differentially expressed genes from un-normalized expression data from 432 breast cancer (BC) samples of The Cancer Genome Atlas (TCGA) were used as inputs for machine learning. The subtype classifier with the fewest number of genes and maximal classification power was selected during cross-validation. The final model was evaluated on 5816 samples from 10 independent studies profiled with four different assay platforms. Upon cross-validation within the TCGA cohort, a random forest classifier (MiniABS) with 11 genes achieved the best accuracy of 88.2%. Applying MiniABS to five validation sets of RNA-seq and microarray data showed an average accuracy of 85.15% (vs. 77.72% for Absolute Intrinsic Molecular Subtype (AIMS)). Only MiniABS could be applied to five low-throughput datasets, showing an average accuracy of 87.93%. The MiniABS can absolutely subtype BC using the raw expression levels of only 11 genes, regardless of assay platform, with higher accuracy than existing methods.
虽然内在分子亚型为乳腺癌提供了重要的生物学分类,但个体的亚型分类受检测技术和研究队列组成的影响。我们试图基于最少数量的基因开发一种与平台无关的绝对单样本亚型分类器。来自癌症基因组图谱(TCGA)的432例乳腺癌(BC)样本的未标准化表达数据中,亚型特异性差异表达基因的成对比率被用作机器学习的输入。在交叉验证过程中,选择了基因数量最少且分类能力最强的亚型分类器。最终模型在来自10项独立研究的5816个样本上进行了评估,这些样本使用四种不同的检测平台进行了分析。在TCGA队列中进行交叉验证时,一个包含11个基因的随机森林分类器(MiniABS)达到了88.2%的最佳准确率。将MiniABS应用于RNA测序和微阵列数据的五个验证集,平均准确率为85.15%(相比绝对内在分子亚型(AIMS)的77.72%)。只有MiniABS可以应用于五个低通量数据集,平均准确率为87.93%。MiniABS仅使用11个基因的原始表达水平就能对BC进行绝对亚型分类,无论检测平台如何,其准确率都高于现有方法。