IEEE Trans Nanobioscience. 2020 Jan;19(1):102-116. doi: 10.1109/TNB.2019.2930494. Epub 2019 Jul 22.
Exploration of various characteristics of 3D protein structures through querying relational databases storing the structures can be challenging due to the necessity to conform to a particular database schema. However, this also brings several advantages, like the ability to perform extensive database searches with declarative SQL language, protect data against hardware damages through regular backup mechanisms, and secure data against unauthorized access. Since relational databases do not provide exploration methods specific for protein data and its biological semantics, like searches on the basis of protein structural patterns, the use of relational databases in this domain is still rare and requires the development of dedicated methods to increase the speed of data exploration techniques. In this paper, we show a novel data partitioning scheme for distributing data across database clusters that can be used for performing sophisticated explorations of 3D protein structures. The data partitioning scheme relies on protein construction, which requires data preprocessing but results in shorter exploration times through querying federated databases. We solve the problem of finding proteins in Oracle relational database on the basis of the similarity of 3D protein structures with the use of distributed PAR-P3D-SQL queries. Since 3D protein structure similarity searching is one of the most time-consuming exploration processes that can be performed for protein data, we make use of a distributed environment of Oracle federated databases, distributed query processing, and dedicated load balancing methods to accelerate the exploration. Results of performed tests confirm that we are able to significantly increase the speed of the exploration process, proportionally to the number of database nodes in the federated environment.
通过查询存储结构的关系型数据库来探索各种 3D 蛋白质结构的特性可能具有挑战性,因为这需要符合特定的数据库模式。然而,这也带来了一些优势,例如能够使用声明式 SQL 语言进行广泛的数据库搜索、通过定期备份机制保护数据免受硬件损坏以及保护数据免受未经授权的访问。由于关系型数据库没有提供针对蛋白质数据及其生物语义的特定探索方法,例如基于蛋白质结构模式的搜索,因此在该领域中使用关系型数据库仍然很少见,需要开发专门的方法来提高数据探索技术的速度。在本文中,我们展示了一种新颖的数据分区方案,用于在数据库集群之间分配数据,可用于对 3D 蛋白质结构进行复杂的探索。该数据分区方案依赖于蛋白质构建,这需要数据预处理,但通过查询联邦数据库可以缩短探索时间。我们使用分布式 PAR-P3D-SQL 查询基于 3D 蛋白质结构的相似性在 Oracle 关系型数据库中查找蛋白质。由于 3D 蛋白质结构相似性搜索是可以对蛋白质数据执行的最耗时的探索过程之一,因此我们利用 Oracle 联邦数据库的分布式环境、分布式查询处理和专用的负载平衡方法来加速探索。执行测试的结果证实,我们能够显著提高探索过程的速度,与联邦环境中的数据库节点数量成正比。