Osteoarthritis Research Program, Division of Orthopedic Surgery, Schroeder Arthritis Institute, and Data Science Discovery Centre for Chronic Diseases, Krembil Research Institute, University Health Network, 60 Leonard Avenue, Toronto, ON M5T 0S8, Canada.
Department of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, United States.
Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad477.
Many real-world problems can be modeled as annotated graphs. Scalable graph algorithms that extract actionable information from such data are in demand since these graphs are large, varying in topology, and have diverse node/edge annotations. When these graphs change over time they create dynamic graphs, and open the possibility to find patterns across different time points. In this article, we introduce a scalable algorithm that finds unique dense regions across time points in dynamic graphs. Such algorithms have applications in many different areas, including the biological, financial, and social domains.
There are three important contributions to this manuscript. First, we designed a scalable algorithm, USNAP, to effectively identify dense subgraphs that are unique to a time stamp given a dynamic graph. Importantly, USNAP provides a lower bound of the density measure in each step of the greedy algorithm. Second, insights and understanding obtained from validating USNAP on real data show its effectiveness. While USNAP is domain independent, we applied it to four non-small cell lung cancer gene expression datasets. Stages in non-small cell lung cancer were modeled as dynamic graphs, and input to USNAP. Pathway enrichment analyses and comprehensive interpretations from literature show that USNAP identified biologically relevant mechanisms for different stages of cancer progression. Third, USNAP is scalable, and has a time complexity of O(m+mc log nc+nc log nc), where m is the number of edges, and n is the number of vertices in the dynamic graph; mc is the number of edges, and nc is the number of vertices in the collapsed graph.
The code of USNAP is available at https://www.cs.utoronto.ca/~juris/data/USNAP22.
许多现实世界的问题都可以建模为有注释的图。从这些数据中提取可操作信息的可扩展图算法需求量很大,因为这些图很大,拓扑结构各异,节点/边注释也多种多样。当这些图随时间变化时,它们会创建动态图,并有可能在不同时间点找到模式。在本文中,我们引入了一种可扩展的算法,该算法可以在动态图中找到随时间变化的独特密集区域。此类算法在许多不同领域都有应用,包括生物、金融和社交领域。
本文有三个重要贡献。首先,我们设计了一种可扩展的算法 USNAP,用于有效地识别给定动态图中特定时间戳的唯一密集子图。重要的是,USNAP 在贪婪算法的每一步都提供了密度度量的下界。其次,通过在真实数据上验证 USNAP 获得的见解和理解表明了其有效性。虽然 USNAP 与领域无关,但我们将其应用于四个非小细胞肺癌基因表达数据集。非小细胞肺癌的各个阶段被建模为动态图,并输入到 USNAP 中。通路富集分析和来自文献的综合解释表明,USNAP 为癌症进展的不同阶段确定了生物学上相关的机制。第三,USNAP 是可扩展的,其时间复杂度为 O(m+mclognc+nc log nc),其中 m 是边的数量,n 是动态图中的顶点数;mc 是边的数量,nc 是折叠图中的顶点数。
USNAP 的代码可在 https://www.cs.utoronto.ca/~juris/data/USNAP22 上获得。