Martins Yasmmin Côrtes, Ziviani Artur, Cerqueira E Costa Maiana de Oliveira, Cavalcanti Maria Cláudia Reis, Nicolás Marisa Fabiana, de Vasconcelos Ana Tereza Ribeiro
Bioinformatics Laboratory, National Laboratory for Scientific Computing, Petrópolis 25651-076, Brazil.
Data Extreme Laboratory (DEXL), National Laboratory for Scientific Computing, Petrópolis 25651-076, Brazil.
Bioinform Adv. 2023 Jun 1;3(1):vbad067. doi: 10.1093/bioadv/vbad067. eCollection 2023.
Semantic web standards have shown importance in the last 20 years in promoting data formalization and interlinking between the existing knowledge graphs. In this context, several ontologies and data integration initiatives have emerged in recent years for the biological area, such as the broadly used Gene Ontology that contains metadata to annotate gene function and subcellular location. Another important subject in the biological area is protein-protein interactions (PPIs) which have applications like protein function inference. Current PPI databases have heterogeneous exportation methods that challenge their integration and analysis. Presently, several initiatives of ontologies covering some concepts of the PPI domain are available to promote interoperability across datasets. However, the efforts to stimulate guidelines for automatic semantic data integration and analysis for PPIs in these datasets are limited. Here, we present PPIntegrator, a system that semantically describes data related to protein interactions. We also introduce an enrichment pipeline to generate, predict and validate new potential host-pathogen datasets by transitivity analysis. PPIntegrator contains a data preparation module to organize data from three reference databases and a triplification and data fusion module to describe the provenance information and results. This work provides an overview of the PPIntegrator system applied to integrate and compare host-pathogen PPI datasets from four bacterial species using our proposed transitivity analysis pipeline. We also demonstrated some critical queries to analyze this kind of data and highlight the importance and usage of the semantic data generated by our system.
https://github.com/YasCoMa/ppintegrator, https://github.com/YasCoMa/ppi_validation_process and https://github.com/YasCoMa/predprin.
语义网标准在过去20年中已显示出在促进数据形式化以及现有知识图谱之间的相互链接方面的重要性。在此背景下,近年来生物领域出现了多项本体和数据整合计划,例如广泛使用的基因本体,它包含用于注释基因功能和亚细胞定位的元数据。生物领域的另一个重要主题是蛋白质-蛋白质相互作用(PPI),其具有诸如蛋白质功能推断等应用。当前的PPI数据库具有异构的导出方法,这对它们的整合和分析构成了挑战。目前,有多项涵盖PPI领域一些概念的本体计划可用于促进跨数据集的互操作性。然而,在这些数据集中推动针对PPI的自动语义数据整合和分析指南的工作有限。在此,我们展示了PPIntegrator,这是一个对与蛋白质相互作用相关的数据进行语义描述的系统。我们还引入了一个富集管道,通过传递性分析来生成、预测和验证新的潜在宿主-病原体数据集。PPIntegrator包含一个数据准备模块,用于整理来自三个参考数据库的数据,以及一个三元化和数据融合模块,用于描述来源信息和结果。这项工作概述了PPIntegrator系统,该系统应用我们提出的传递性分析管道来整合和比较来自四种细菌物种的宿主-病原体PPI数据集。我们还展示了一些用于分析此类数据的关键查询,并强调了我们系统生成的语义数据的重要性和用途。