BIO3 - Systems Genetics, GIGA-R Medical Genomics, University of Liege, Liege, Belgium.
BIO3 - Systems Medicine, Department of Human Genetics, KU Leuven, Leuven, Belgium.
Brief Bioinform. 2023 Mar 19;24(2). doi: 10.1093/bib/bbad029.
Many problems in life sciences can be brought back to a comparison of graphs. Even though a multitude of such techniques exist, often, these assume prior knowledge about the partitioning or the number of clusters and fail to provide statistical significance of observed between-network heterogeneity. Addressing these issues, we developed an unsupervised workflow to identify groups of graphs from reliable network-based statistics. In particular, we first compute the similarity between networks via appropriate distance measures between graphs and use them in an unsupervised hierarchical algorithm to identify classes of similar networks. Then, to determine the optimal number of clusters, we recursively test for distances between two groups of networks. The test itself finds its inspiration in distance-wise ANOVA algorithms. Finally, we assess significance via the permutation of between-object distance matrices. Notably, the approach, which we will call netANOVA, is flexible since users can choose multiple options to adapt to specific contexts and network types. We demonstrate the benefits and pitfalls of our approach via extensive simulations and an application to two real-life datasets. NetANOVA achieved high performance in many simulation scenarios while controlling type I error. On non-synthetic data, comparison against state-of-the-art methods showed that netANOVA is often among the top performers. There are many application fields, including precision medicine, for which identifying disease subtypes via individual-level biological networks improves prevention programs, diagnosis and disease monitoring.
许多生命科学中的问题都可以归结为图形比较。尽管存在多种这样的技术,但这些技术通常都需要预先了解分区或聚类的数量,并且无法提供观察到的网络间异质性的统计显著性。为了解决这些问题,我们开发了一种无监督的工作流程,以从可靠的基于网络的统计数据中识别图形组。具体来说,我们首先通过图形之间的适当距离度量来计算网络之间的相似性,并将其用于无监督的层次算法中,以识别相似网络的类别。然后,为了确定最佳聚类数量,我们递归地测试两组网络之间的距离。该测试本身的灵感来源于距离方差分析算法。最后,我们通过对象间距离矩阵的置换来评估显著性。值得注意的是,我们称之为 netANOVA 的方法很灵活,因为用户可以选择多种选项来适应特定的上下文和网络类型。我们通过广泛的模拟和对两个真实数据集的应用,展示了我们方法的优势和局限性。在许多模拟场景中,netANOVA 都能很好地控制第一类错误,实现了高性能。在非合成数据上,与最先进方法的比较表明,netANOVA 通常是表现最好的方法之一。有许多应用领域,包括精准医学,通过个体水平的生物网络来识别疾病亚型可以改善预防计划、诊断和疾病监测。