Livingston Kevin M, Bada Michael, Baumgartner William A, Hunter Lawrence E
Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
BMC Bioinformatics. 2015 Apr 23;16(1):126. doi: 10.1186/s12859-015-0559-3.
The ability to query many independent biological databases using a common ontology-based semantic model would facilitate deeper integration and more effective utilization of these diverse and rapidly growing resources. Despite ongoing work moving toward shared data formats and linked identifiers, significant problems persist in semantic data integration in order to establish shared identity and shared meaning across heterogeneous biomedical data sources.
We present five processes for semantic data integration that, when applied collectively, solve seven key problems. These processes include making explicit the differences between biomedical concepts and database records, aggregating sets of identifiers denoting the same biomedical concepts across data sources, and using declaratively represented forward-chaining rules to take information that is variably represented in source databases and integrating it into a consistent biomedical representation. We demonstrate these processes and solutions by presenting KaBOB (the Knowledge Base Of Biomedicine), a knowledge base of semantically integrated data from 18 prominent biomedical databases using common representations grounded in Open Biomedical Ontologies. An instance of KaBOB with data about humans and seven major model organisms can be built using on the order of 500 million RDF triples. All source code for building KaBOB is available under an open-source license.
KaBOB is an integrated knowledge base of biomedical data representationally based in prominent, actively maintained Open Biomedical Ontologies, thus enabling queries of the underlying data in terms of biomedical concepts (e.g., genes and gene products, interactions and processes) rather than features of source-specific data schemas or file formats. KaBOB resolves many of the issues that routinely plague biomedical researchers intending to work with data from multiple data sources and provides a platform for ongoing data integration and development and for formal reasoning over a wealth of integrated biomedical data.
使用基于通用本体的语义模型查询多个独立生物数据库的能力,将有助于更深入地整合和更有效地利用这些多样且快速增长的资源。尽管在朝着共享数据格式和链接标识符的方向不断努力,但在语义数据集成方面仍存在重大问题,以便在异构生物医学数据源之间建立共享身份和共享含义。
我们提出了五个语义数据集成过程,这些过程共同应用时可解决七个关键问题。这些过程包括明确生物医学概念与数据库记录之间的差异,汇总跨数据源表示相同生物医学概念的标识符集,并使用声明式表示的前向链规则获取在源数据库中以可变方式表示的信息,并将其整合到一致的生物医学表示中。我们通过展示KaBOB(生物医学知识库)来演示这些过程和解决方案,KaBOB是一个语义集成数据的知识库,它使用基于开放生物医学本体的通用表示,整合了18个著名生物医学数据库的数据。使用大约5亿个RDF三元组可以构建一个包含人类和七种主要模式生物数据的KaBOB实例。构建KaBOB的所有源代码都可在开源许可下获取。
KaBOB是一个基于著名的、积极维护的开放生物医学本体的生物医学数据集成知识库,从而能够根据生物医学概念(如基因和基因产物、相互作用和过程)而不是特定于源的数据模式或文件格式的特征来查询基础数据。KaBOB解决了许多经常困扰打算使用来自多个数据源的数据的生物医学研究人员的问题,并为正在进行的数据集成和开发以及对大量集成生物医学数据进行形式推理提供了一个平台。