Gao Yifan, Mughal Zakariyya, Jaramillo-Villegas Jose A, Corradi Marie, Borrel Alexandre, Lieberman Ben, Sharif Suliman, Shaffer John, Fecho Karamarie, Chatrath Ajay, Maertens Alexandra, Teunis Marc A T, Kleinstreuer Nicole, Hartung Thomas, Luechtefeld Thomas
Center for Alternative to Animal Testing, Johns Hopkins University, Baltimore, MD, United States.
Insilica, Bethesda, MD, United States.
Front Artif Intell. 2025 Aug 13;8:1599412. doi: 10.3389/frai.2025.1599412. eCollection 2025.
Researchers in biomedicine and public health often spend weeks locating, cleansing, and integrating data from disparate sources before analysis can begin. This redundancy slows discovery and leads to inconsistent pipelines.
We created BioBricks.ai, an open, centralized repository that packages public biological and chemical datasets as modular "bricks." Each brick is a Data Version Control (DVC) Git repository containing an extract‑transform‑load (ETL) pipeline. A package‑manager-like interface handles installation, dependency resolution, and updates, while data are delivered through a unified backend (https://biobricks.ai).
The current release provides >90 curated datasets spanning genomics, proteomics, cheminformatics, and epidemiology. Bricks can be combined programmatically to build composite resources; benchmark use‑cases show that assembling multi‑dataset analytic cohorts is reduced from days to minutes compared with bespoke scripts.
BioBricks.ai accelerates data access, promotes reproducible workflows, and lowers the barrier for integrating heterogeneous public datasets. By treating data as version‑controlled software, the platform encourages community contributions and reduces redundant engineering effort. Continued expansion of brick coverage and automated provenance tracking will further enhance FAIR (Findable, Accessible, Interoperable, Reusable) data practices across the life‑science community.
生物医学和公共卫生领域的研究人员通常需要花费数周时间来查找、清理和整合来自不同来源的数据,然后才能开始分析。这种冗余减缓了发现速度,并导致管道不一致。
我们创建了BioBricks.ai,这是一个开放的集中式存储库,将公共生物和化学数据集打包为模块化的“砖块”。每个砖块都是一个数据版本控制(DVC)Git存储库,包含一个提取-转换-加载(ETL)管道。一个类似包管理器的界面处理安装、依赖项解析和更新,而数据则通过统一的后端(https://biobricks.ai)提供。
当前版本提供了90多个经过策划的数据集,涵盖基因组学、蛋白质组学、化学信息学和流行病学。砖块可以通过编程方式组合以构建复合资源;基准用例表明,与定制脚本相比,组装多数据集分析队列的时间从数天缩短至数分钟。
BioBricks.ai加速了数据访问,促进了可重复的工作流程,并降低了整合异构公共数据集的障碍。通过将数据视为版本控制的软件,该平台鼓励社区贡献并减少冗余的工程工作。砖块覆盖范围的持续扩大和自动溯源跟踪将进一步加强生命科学社区的FAIR(可查找、可访问、可互操作、可重用)数据实践。