Departamento de Informática y Sistemas, Universidad de Murcia, CEIR Campus Mare Nostrum, Instituto Murciano de Investigación Biosanitaria (IMIB),30100 Murcia, Spain.
Department of Biology, Norwegian University of Science and Technology, NO-7491 Trondheim, Norway.
Nucleic Acids Res. 2024 Aug 27;52(15):e69. doi: 10.1093/nar/gkae566.
Knowledge about transcription factor binding and regulation, target genes, cis-regulatory modules and topologically associating domains is not only defined by functional associations like biological processes or diseases but also has a determinative genome location aspect. Here, we exploit these location and functional aspects together to develop new strategies to enable advanced data querying. Many databases have been developed to provide information about enhancers, but a schema that allows the standardized representation of data, securing interoperability between resources, has been lacking. In this work, we use knowledge graphs for the standardized representation of enhancers and topologically associating domains, together with data about their target genes, transcription factors, location on the human genome, and functional data about diseases and gene ontology annotations. We used this schema to integrate twenty-five enhancer datasets and two domain datasets, creating the most powerful integrative resource in this field to date. The knowledge graphs have been implemented using the Resource Description Framework and integrated within the open-access BioGateway knowledge network, generating a resource that contains an interoperable set of knowledge graphs (enhancers, TADs, genes, proteins, diseases, GO terms, and interactions between domains). We show how advanced queries, which combine functional and location restrictions, can be used to develop new hypotheses about functional aspects of gene expression regulation.
关于转录因子结合和调控、靶基因、顺式调控模块和拓扑关联域的知识不仅由生物过程或疾病等功能关联来定义,还具有决定性的基因组位置方面。在这里,我们共同利用这些位置和功能方面来开发新的策略,以实现高级数据查询。已经开发了许多数据库来提供有关增强子的信息,但缺乏允许数据标准化表示、确保资源之间互操作性的模式。在这项工作中,我们使用知识图来标准化表示增强子和拓扑关联域,以及它们的靶基因、转录因子、在人类基因组上的位置以及关于疾病和基因本体论注释的功能数据。我们使用此模式集成了二十五种增强子数据集和两种域数据集,创建了迄今为止该领域最强大的综合资源。知识图使用资源描述框架实现,并集成在开放访问的 BioGateway 知识网络中,生成一个包含一组可互操作的知识图(增强子、TAD、基因、蛋白质、疾病、GO 术语和域之间的相互作用)的资源。我们展示了如何结合功能和位置限制的高级查询可用于开发关于基因表达调控功能方面的新假设。