Battistella Enzo, Vakalopoulou Maria, Sun Roger, Estienne Theo, Lerousseau Marvin, Nikolaev Sergey, Andres Emilie Alvarez, Carre Alexandre, Niyoteka Stephane, Robert Charlotte, Paragios Nikos, Deutsch Eric
IEEE/ACM Trans Comput Biol Bioinform. 2022 Nov-Dec;19(6):3317-3331. doi: 10.1109/TCBB.2021.3123910. Epub 2022 Dec 8.
Precision medicine is a paradigm shift in healthcare relying heavily on genomics data. However, the complexity of biological interactions, the large number of genes as well as the lack of comparisons on the analysis of data, remain a tremendous bottleneck regarding clinical adoption. In this paper, we introduce a novel, automatic and unsupervised framework to discover low-dimensional gene biomarkers. Our method is based on the LP-Stability algorithm, a high dimensional center-based unsupervised clustering algorithm. It offers modularity as concerns metric functions and scalability, while being able to automatically determine the best number of clusters. Our evaluation includes both mathematical and biological criteria to define a quantitative metric. The recovered signature is applied to a variety of biological tasks, including screening of biological pathways and functions, and characterization relevance on tumor types and subtypes. Quantitative comparisons among different distance metrics, commonly used clustering methods and a referential gene signature used in the literature, confirm state of the art performance of our approach. In particular, our signature, based on 27 genes, reports at least 30 times better mathematical significance (average Dunn's Index) and 25% better biological significance (average Enrichment in Protein-Protein Interaction) than those produced by other referential clustering methods. Finally, our signature reports promising results on distinguishing immune inflammatory and immune desert tumors, while reporting a high balanced accuracy of 92% on tumor types classification and averaged balanced accuracy of 68% on tumor subtypes classification, which represents, respectively 7% and 9% higher performance compared to the referential signature.
精准医学是医疗保健领域的一次范式转变,严重依赖基因组学数据。然而,生物相互作用的复杂性、大量的基因以及数据缺乏比较分析,仍然是临床应用的巨大瓶颈。在本文中,我们介绍了一种新颖的、自动的和无监督的框架来发现低维基因生物标志物。我们的方法基于LP-Stability算法,这是一种基于高维中心的无监督聚类算法。它在度量函数和可扩展性方面具有模块化,同时能够自动确定最佳聚类数。我们的评估包括数学和生物学标准来定义定量指标。恢复的特征被应用于各种生物学任务,包括生物途径和功能的筛选,以及肿瘤类型和亚型的特征相关性。不同距离度量、常用聚类方法和文献中使用的参考基因特征之间的定量比较,证实了我们方法的先进性能。特别是,我们基于27个基因的特征,在数学意义(平均邓恩指数)上比其他参考聚类方法产生的特征至少好30倍,在生物学意义(蛋白质-蛋白质相互作用中的平均富集)上好25%。最后,我们的特征在区分免疫炎症和免疫沙漠肿瘤方面取得了有希望的结果,同时在肿瘤类型分类上报告了92%的高平衡准确率,在肿瘤亚型分类上报告了68%的平均平衡准确率,分别比参考特征高出7%和9%的性能。