Qi Lin, Wang Wei, Wu Tan, Zhu Lina, He Lingli, Wang Xin
Department of Biomedical Sciences, City University of Hong Kong, Shenzhen, China.
Key Laboratory of Biochip Technology, Biotech and Health Centre, Shenzhen Research Institute, City University of Hong Kong, Shenzhen, China.
Front Genet. 2021 Jul 22;12:607817. doi: 10.3389/fgene.2021.607817. eCollection 2021.
It is now clear that major malignancies are heterogeneous diseases associated with diverse molecular properties and clinical outcomes, posing a great challenge for more individualized therapy. In the last decade, cancer molecular subtyping studies were mostly based on transcriptomic profiles, ignoring heterogeneity at other (epi-)genetic levels of gene regulation. Integrating multiple types of (epi)genomic data generates a more comprehensive landscape of biological processes, providing an opportunity to better dissect cancer heterogeneity. Here, we propose sparse canonical correlation analysis for cancer classification (SCCA-CC), which projects each type of single-omics data onto a unified space for data fusion, followed by clustering and classification analysis. Without loss of generality, as case studies, we integrated two types of omics data, mRNA and miRNA profiles, for molecular classification of ovarian cancer ( = 462), and breast cancer ( = 451). The two types of omics data were projected onto a unified space using SCCA, followed by data fusion to identify cancer subtypes. The subtypes we identified recapitulated subtypes previously recognized by other groups (all - values < 0.001), but display more significant clinical associations. Especially in ovarian cancer, the four subtypes we identified were significantly associated with overall survival, while the taxonomy previously established by TCGA did not ( values: 0.039 vs. 0.12). The multi-omics classifiers we established can not only classify individual types of data but also demonstrated higher accuracies on the fused data. Compared with iCluster, SCCA-CC demonstrated its superiority by identifying subtypes of higher coherence, clinical relevance, and time efficiency. In conclusion, we developed an integrated bioinformatic framework SCCA-CC for cancer molecular subtyping. Using two case studies in breast and ovarian cancer, we demonstrated its effectiveness in identifying biologically meaningful and clinically relevant subtypes. SCCA-CC presented a unique advantage in its ability to classify both single-omics data and multi-omics data, which significantly extends the applicability to various data types, and making more efficient use of published omics resources.
现在很清楚,主要恶性肿瘤是具有多种分子特性和临床结局的异质性疾病,这对更个性化的治疗提出了巨大挑战。在过去十年中,癌症分子亚型研究主要基于转录组图谱,忽略了基因调控的其他(表观)遗传水平的异质性。整合多种类型的(表观)基因组数据可生成更全面的生物过程图景,为更好地剖析癌症异质性提供了机会。在此,我们提出用于癌症分类的稀疏典型相关分析(SCCA-CC),它将每种类型的单组学数据投影到统一空间进行数据融合,然后进行聚类和分类分析。不失一般性,作为案例研究,我们整合了两种组学数据,即mRNA和miRNA图谱,用于卵巢癌(n = 462)和乳腺癌(n = 451)的分子分类。使用SCCA将两种组学数据投影到统一空间,然后进行数据融合以识别癌症亚型。我们识别出的亚型概括了先前其他研究小组识别出的亚型(所有p值<0.001),但显示出更显著的临床关联。特别是在卵巢癌中,我们识别出的四种亚型与总生存期显著相关,而TCGA先前建立的分类法却没有(p值:0.039对0.12)。我们建立的多组学分类器不仅可以对单个类型的数据进行分类,而且在融合数据上表现出更高的准确性。与iCluster相比,SCCA-CC通过识别具有更高一致性、临床相关性和时间效率的亚型证明了其优越性。总之,我们开发了一种用于癌症分子亚型分析的综合生物信息学框架SCCA-CC。通过在乳腺癌和卵巢癌中的两个案例研究,我们证明了它在识别具有生物学意义和临床相关性的亚型方面的有效性。SCCA-CC在对单组学数据和多组学数据进行分类的能力方面具有独特优势,这显著扩展了其对各种数据类型的适用性,并更有效地利用已发表的组学资源。