Ni Eric, Knight Elizabeth, Gerstein Mark
Program in Computational Biology and Bioinformatics, Yale University, 100 College Street, New Haven, 06511, CT, USA.
Yale School of Medicine, 333 Cedar Street, New Haven, 06510, CT, USA.
J Biomed Inform. 2025 May;165:104818. doi: 10.1016/j.jbi.2025.104818. Epub 2025 Mar 29.
Blockchain technology is gaining traction in the biomedical sector due to its ability to improve trust and reduce the risk of fraud and errors in health data management. However, the large volume of biomedical datasets has slowed its adoption due to poor scalability. This challenge is especially relevant for applications that rely on blockchain's strong immutability by storing data directly on-chain. In this work, we demonstrate the potential of blockchain to create a secure and trustless environment for managing large on-chain records. Specifically, we detail an efficient, index-based approach for storing data on the Ethereum blockchain. We show that insertion and retrieval speeds remain nearly constant relative to database size, scaling linearly with the amount of data processed. Additionally, we achieve substantial efficiency gains through low-level assembly optimizations on the Ethereum Virtual Machine, highlighting the limitations of the Solidity compiler. Finally, we illustrate this approach through a practical case study, by designing and implementing a smart contract for storing and querying training certificates on the Ethereum blockchain. Our solution achieves 2x faster data insertion, 500x faster retrieval, 60% lower gas costs, and 50% lower storage usage compared to baseline methods. It won first place for track 1 of the 2022 iDASH secure genome analysis competition. We also demonstrate that this solution readily adapts to other data types, enabling efficient on-chain storage and retrieval of text, RNA-seq, or biomedical image data.
区块链技术因其能够提高信任度并降低健康数据管理中的欺诈和错误风险,而在生物医学领域越来越受到关注。然而,由于可扩展性差,大量的生物医学数据集减缓了其应用。对于那些通过直接在链上存储数据来依赖区块链强大不可变性的应用来说,这一挑战尤为突出。在这项工作中,我们展示了区块链为管理大型链上记录创建安全且无需信任的环境的潜力。具体而言,我们详细介绍了一种基于索引的高效方法,用于在以太坊区块链上存储数据。我们表明,相对于数据库大小,插入和检索速度几乎保持恒定,随处理的数据量呈线性扩展。此外,我们通过以太坊虚拟机上的底层汇编优化实现了显著的效率提升,突出了Solidity编译器的局限性。最后,我们通过一个实际案例研究来说明这种方法,即设计并实现一个用于在以太坊区块链上存储和查询培训证书的智能合约。与基线方法相比,我们的解决方案实现了快2倍的数据插入速度、快500倍的检索速度、低60%的燃气成本以及低50%的存储使用量。它在2022年iDASH安全基因组分析竞赛的赛道1中获得了第一名。我们还证明了该解决方案能够轻松适应其他数据类型,实现对文本、RNA测序或生物医学图像数据的高效链上存储和检索。