IEEE/ACM Trans Comput Biol Bioinform. 2022 Jan-Feb;19(1):135-145. doi: 10.1109/TCBB.2021.3060340. Epub 2022 Feb 3.
The integration of several sources of data for the identification of subtypes of diseases has gained attention over the past few years. The heterogeneity and the high dimensions of the data sets calls for an adequate representation of the data. We summarize the field of representation learning for the multi-omics clustering problem and we investigate several techniques to learn relevant combined representations, using methods from group factor analysis (PCA, MFA and extensions) and from machine learning with autoencoders. We highlight the importance of appropriately designing and training the latter, notably with a novel combination of a disjointed deep autoencoder (DDAE) architecture and a layer-wise reconstruction loss. These different representations can then be clustered to identify biologically meaningful clusters of patients. We provide a unifying framework for model comparison between statistical and deep learning approaches with the introduction of a new weighted internal clustering index that evaluates how well the clustering information is retained from each source, favoring contributions from all data sets. We apply our methodology to two case studies for which previous works of integrative clustering exist, TCGA Breast Cancer and TARGET Neuroblastoma, and show how our method can yield good and well-balanced clusters across the different data sources.
近年来,人们越来越关注将多种数据源整合起来以识别疾病亚型。由于数据集的异质性和高维度,需要对数据进行适当的表示。我们总结了多组学聚类问题的表示学习领域,并研究了几种技术,以使用来自组因子分析(PCA、MFA 和扩展)和机器学习的自动编码器学习相关的组合表示。我们强调了适当设计和训练后者的重要性,特别是使用不相交深度自动编码器(DDAE)架构和分层重建损失的新颖组合。然后,可以对这些不同的表示进行聚类,以识别具有生物学意义的患者聚类。我们通过引入新的加权内部聚类指数,为统计和深度学习方法之间的模型比较提供了一个统一的框架,该指数评估了从每个源保留聚类信息的程度,从而有利于所有数据集的贡献。我们将我们的方法应用于两个具有整合聚类先前工作的案例研究,TCGA 乳腺癌和 TARGET 神经母细胞瘤,并展示了我们的方法如何能够在不同的数据源中产生良好且平衡的聚类。