Rather Arif Ahmad, Chachoo Manzoor Ahmad
Department of Computer Sciences, University of Kashmir, Srinagar, JK, India.
Department of Computer Sciences, University of Kashmir, Srinagar, JK, India.
Comput Biol Med. 2023 Mar;155:106640. doi: 10.1016/j.compbiomed.2023.106640. Epub 2023 Feb 8.
Deciphering information hidden in the gene expression assays for identifying disease subtypes has significant importance in precision medicine. However, computational limitations thwart this process due to the intricacy of the biological networks and the curse of dimensionality of gene expression data. Therefore, clustering in such scenarios often becomes the first choice of exploratory data analysis to identify natural structures and intrinsic patterns in the data. However, sparse and high dimensional nature of omics data prevents conventional clustering algorithms to discover subtypes that are clinically relevant and statistically significant. Hence, non-linear dimensionality reduction techniques coupled with clustering in such scenarios often becomes imperative to improve the clustering results. In this study, we present a robust pipeline to discover disease subtypes with clinical relevance. Specifically, we focus on discovering patient sub-groups that have a residual life patterns remarkably different from other sub-groups. This is significant because by refining prognosis, subtyping can reduce uncertainty in approximating patients expected outcome. The methodology present is based on robust correlation estimation, UMAP- a non-linear dimensionality reduction method and mapper- a tool from topology. Notably, we suggest a method for improving the robustness of the correlation matrix of gene expression data for improving the clustering results. The performance of the model is evaluated by applying to five cancer datasets obtained through TCGA and comparisons are performed with some state of the art methods of NEMO, RSC-OTRI and SNF with regard to log-rank test and Restricted Life Expectancy Difference. For example in GBM dataset, the minimum separation for any two discovered subtypes is 221 days which is significantly higher than the other methodologies. We also compared the results without using the robust correlation based estimate and observed that robust correlation improves separability between survival curves significantly. From the results we infer that our methodology performs better compared to other methodologies with regard to separating survival curves of patient sub-groups despite using single omics profiles of patients compared to multiple omics profiles of SNF and NEMO. Pathway over-representation analysis is performed on the final clustering results to investigate the biological underpinnings characterizing each subtype.
在精准医学中,解读基因表达分析中隐藏的信息以识别疾病亚型具有重要意义。然而,由于生物网络的复杂性和基因表达数据的维度诅咒,计算限制阻碍了这一过程。因此,在这种情况下进行聚类通常成为探索性数据分析的首选,以识别数据中的自然结构和内在模式。然而,组学数据的稀疏性和高维性使得传统聚类算法难以发现具有临床相关性和统计学意义的亚型。因此,在这种情况下,结合聚类的非线性降维技术通常对于改善聚类结果至关重要。在本研究中,我们提出了一个稳健的流程来发现具有临床相关性的疾病亚型。具体而言,我们专注于发现那些剩余生命模式与其他亚组显著不同的患者亚组。这很重要,因为通过细化预后,亚型分类可以减少估计患者预期结果时的不确定性。所提出的方法基于稳健的相关性估计、UMAP(一种非线性降维方法)和Mapper(一种来自拓扑学的工具)。值得注意的是,我们提出了一种提高基因表达数据相关矩阵稳健性的方法,以改善聚类结果。通过将模型应用于通过TCGA获得的五个癌症数据集来评估模型的性能,并在对数秩检验和受限预期寿命差异方面与一些先进的方法(如NEMO、RSC - OTRI和SNF)进行比较。例如,在GBM数据集中,任何两个发现的亚型之间的最小间隔为221天,这明显高于其他方法。我们还比较了不使用基于稳健相关性估计的结果,发现稳健相关性显著提高了生存曲线之间的可分离性。从结果中我们推断,尽管与SNF和NEMO使用多个组学概况相比,我们的方法使用的是患者的单个组学概况,但在分离患者亚组的生存曲线方面,我们的方法比其他方法表现更好。对最终聚类结果进行通路过度表达分析以研究表征每个亚型的生物学基础。