Silva Danilo, Moir Monika, Dunaiski Marcel, Blanco Natalia, Murtala-Ibrahim Fati, Baxter Cheryl, de Oliveira Tulio, Xavier Joicymara S
Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, 7602, South Africa.
Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, 7602, South Africa.
Bioinform Adv. 2025 Jul 18;5(1):vbaf168. doi: 10.1093/bioadv/vbaf168. eCollection 2025.
In a world where data drive effective decision-making, bioinformatics and health science researchers often encounter difficulties managing data efficiently. In these fields, data are typically diverse in format and subject. Consequently, challenges in storing, tracking, and responsibly sharing valuable data have become increasingly evident over the past decades. To address the complexities, some approaches have leveraged standard strategies, such as using non-relational databases and data warehouses. However, these approaches often fall short in providing the flexibility and scalability required for complex projects. While the data lake paradigm has emerged to offer flexibility and handle large volumes of diverse data, it lacks robust data governance and organization. The data lakehouse is a new paradigm that combines the flexibility of a data lake with the governance of a data warehouse, offering a promising solution for managing heterogeneous data in bioinformatics. However, the lakehouse model remains unexplored in bioinformatics, with limited discussion in the current literature. In this study, we review strategies and tools for developing a data lakehouse infrastructure tailored to bioinformatics research. We summarize key concepts and assess available open-source and commercial solutions for managing data in bioinformatics.
Not applicable.
在一个数据驱动有效决策的世界里,生物信息学和健康科学研究人员在有效管理数据方面常常遇到困难。在这些领域,数据通常在格式和主题上多种多样。因此,在过去几十年里,存储、跟踪和负责任地共享有价值数据方面的挑战日益明显。为了应对这些复杂性,一些方法利用了标准策略,比如使用非关系型数据库和数据仓库。然而,这些方法在提供复杂项目所需的灵活性和可扩展性方面往往不足。虽然数据湖范式已经出现,以提供灵活性并处理大量多样的数据,但它缺乏强大的数据治理和组织。数据湖仓是一种新范式,它将数据湖的灵活性与数据仓库的治理相结合,为生物信息学中管理异构数据提供了一个有前景的解决方案。然而,湖仓模型在生物信息学中仍未得到探索,当前文献中的讨论也很有限。在本研究中,我们回顾了为生物信息学研究量身定制的数据湖仓基础设施的开发策略和工具。我们总结了关键概念,并评估了用于生物信息学数据管理的可用开源和商业解决方案。
不适用。