Shi Qianqian, Hu Bing, Zeng Tao, Zhang Chuanchao
Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China.
Department of Applied Mathematics, College of Science, Zhejiang University of Technology, Hangzhou, China.
Front Genet. 2019 Aug 20;10:744. doi: 10.3389/fgene.2019.00744. eCollection 2019.
Integration of distinct biological data types could provide a comprehensive view of biological processes or complex diseases. The combinations of molecules responsible for different phenotypes form multiple embedded (expression) subspaces, thus identifying the intrinsic data structure is challenging by regular integration methods. In this paper, we propose a novel framework of "Multi-view Subspace Clustering Analysis (MSCA)," which could measure the local similarities of samples in the same subspace and obtain the global consensus sample patterns (structures) for multiple data types, thereby comprehensively capturing the underlying heterogeneity of samples. Applied to various synthetic datasets, MSCA performs effectively to recognize the predefined sample patterns, and is robust to data noises. Given a real biological dataset, i.e., Cancer Cell Line Encyclopedia (CCLE) data, MSCA successfully identifies cell clusters of common aberrations across cancer types. A remarkable superiority over the state-of-the-art methods, such as iClusterPlus, SNF, and ANF, has also been demonstrated in our simulation and case studies.
整合不同的生物数据类型可以提供生物过程或复杂疾病的全面视图。负责不同表型的分子组合形成多个嵌入(表达)子空间,因此通过常规整合方法识别内在数据结构具有挑战性。在本文中,我们提出了一种新颖的“多视图子空间聚类分析(MSCA)”框架,该框架可以测量同一子空间中样本的局部相似性,并获得多种数据类型的全局共识样本模式(结构),从而全面捕捉样本潜在的异质性。应用于各种合成数据集时,MSCA能够有效地识别预定义的样本模式,并且对数据噪声具有鲁棒性。对于一个真实的生物数据集,即癌细胞系百科全书(CCLE)数据,MSCA成功地识别了跨癌症类型的常见畸变细胞簇。在我们的模拟和案例研究中,也证明了MSCA相对于诸如iClusterPlus、SNF和ANF等现有方法具有显著优势。