College of Mathematics and System Sciences, Xinjiang University, Urumqi, China.
School of Computer Science and Technology, Anhui University, Hefei, China.
BMC Bioinformatics. 2021 May 25;22(1):268. doi: 10.1186/s12859-021-04195-4.
In recent years, various sequencing techniques have been used to collect biomedical omics datasets. It is usually possible to obtain multiple types of omics data from a single patient sample. Clustering of omics data plays an indispensable role in biological and medical research, and it is helpful to reveal data structures from multiple collections. Nevertheless, clustering of omics data consists of many challenges. The primary challenges in omics data analysis come from high dimension of data and small size of sample. Therefore, it is difficult to find a suitable integration method for structural analysis of multiple datasets.
In this paper, a multi-view clustering based on Stiefel manifold method (MCSM) is proposed. The MCSM method comprises three core steps. Firstly, we established a binary optimization model for the simultaneous clustering problem. Secondly, we solved the optimization problem by linear search algorithm based on Stiefel manifold. Finally, we integrated the clustering results obtained from three omics by using k-nearest neighbor method. We applied this approach to four cancer datasets on TCGA. The result shows that our method is superior to several state-of-art methods, which depends on the hypothesis that the underlying omics cluster class is the same.
Particularly, our approach has better performance than compared approaches when the underlying clusters are inconsistent. For patients with different subtypes, both consistent and differential clusters can be identified at the same time.
近年来,各种测序技术已被用于收集生物医学组学数据集。通常可以从单个患者样本中获得多种类型的组学数据。组学数据聚类在生物和医学研究中起着不可或缺的作用,有助于揭示来自多个集合的数据结构。然而,组学数据聚类包含许多挑战。组学数据分析中的主要挑战来自于数据的高维性和样本的小尺寸。因此,很难找到一种合适的方法来对多个数据集进行结构分析。
本文提出了一种基于 Stiefel 流形的多视图聚类方法(MCSM)。MCSM 方法包括三个核心步骤。首先,我们建立了一个同时聚类问题的二进制优化模型。其次,我们基于 Stiefel 流形通过线性搜索算法求解了优化问题。最后,我们使用 K-最近邻方法整合了从三种组学中获得的聚类结果。我们在 TCGA 上的四个癌症数据集上应用了这种方法。结果表明,我们的方法优于几种最先进的方法,这取决于假设潜在的组学聚类类是相同的。
特别是当潜在的聚类不一致时,我们的方法比比较方法具有更好的性能。对于具有不同亚型的患者,可以同时识别一致和差异聚类。