Department of Bioinformatics and Data Science, Cell Signaling Technology Inc., 3 Trask Lane, Danvers, MA 01923, USA.
Database (Oxford). 2020 Dec 11;2020. doi: 10.1093/database/baaa110.
Graph representations provide an elegant solution to capture and analyze complex molecular mechanisms in the cell. Co-expression networks are undirected graph representations of transcriptional co-behavior indicating (co-)regulations, functional modules or even physical interactions between the corresponding gene products. The growing avalanche of available RNA sequencing (RNAseq) data fuels the construction of such networks, which are usually stored in relational databases like most other biological data. Inferring linkage by recursive multiple-join statements, however, is computationally expensive and complex to design in relational databases. In contrast, graph databases store and represent complex interconnected data as nodes, edges and properties, making it fast and intuitive to query and analyze relationships. While graph-based database technologies are on their way from a fringe domain to going mainstream, there are only a few studies reporting their application to biological data. We used the graph database management system Neo4j to store and analyze co-expression networks derived from RNAseq data from The Cancer Genome Atlas. Comparing co-expression in tumors versus healthy tissues in six cancer types revealed significant perturbation tracing back to erroneous or rewired gene regulation. Applying centrality, community detection and pathfinding graph algorithms uncovered the destruction or creation of central nodes, modules and relationships in co-expression networks of tumors. Given the speed, accuracy and straightforwardness of managing these densely connected networks, we conclude that graph databases are ready for entering the arena of biological data.
图表示法为捕获和分析细胞中复杂的分子机制提供了一种优雅的解决方案。共表达网络是转录共表达行为的无向图表示,表明(共)调控、功能模块甚至相应基因产物之间的物理相互作用。越来越多的可用 RNA 测序 (RNAseq) 数据推动了这些网络的构建,这些网络通常存储在关系数据库中,就像大多数其他生物数据一样。然而,通过递归多连接语句推断链接在计算上是昂贵且复杂的,在关系数据库中设计。相比之下,图形数据库将复杂的互联数据存储和表示为节点、边和属性,从而快速直观地查询和分析关系。虽然基于图的数据库技术正在从边缘领域走向主流,但只有少数研究报告了它们在生物数据中的应用。我们使用图形数据库管理系统 Neo4j 存储和分析来自癌症基因组图谱 RNAseq 数据的共表达网络。比较六种癌症类型中肿瘤与健康组织的共表达情况,发现了可追溯到错误或重新布线基因调控的显著干扰。应用中心性、社区检测和路径查找图算法揭示了肿瘤共表达网络中中心节点、模块和关系的破坏或创建。考虑到管理这些密集连接网络的速度、准确性和简单性,我们得出结论,图形数据库已经准备好进入生物数据领域。