Imburgia Carina, Organick Lee, Zhang Karen, Cardozo Nicolas, McBride Jeff, Bee Callista, Wilde Delaney, Roote Gwendolin, Jorgensen Sophia, Ward David, Anderson Charlie, Strauss Karin, Ceze Luis, Nivala Jeff
University of Washington, Paul G. Allen School of Computer Science and Engineering, Seattle, USA.
Microsoft Research, Redmond, USA.
Nat Commun. 2025 Jul 10;16(1):6388. doi: 10.1038/s41467-025-61264-5.
DNA is a promising medium for digital data storage due to its exceptional data density and longevity. Practical DNA-based storage systems require selective data retrieval to minimize decoding time and costs. In this work, we introduce CRISPR-Cas9 as a user-friendly tool for multiplexed, low-latency molecular data extraction. We first present a one-pot, multiplexed random access method in which specific data files are selectively cleaved using a CRISPR-Cas9 addressing system and then sequenced via nanopore technology. This approach was validated on a pool of 1.6 million DNA sequences, comprising 25 unique data files. We then developed a molecular similarity-search approach combining machine learning with Cas9-based retrieval. Using a deep neural network, we mapped a database of 1.74 million images into a reduced-dimensional embedding, encoding each embedding as a Cas9 target sequence. These target sequences act as molecular addresses, capturing clusters of semantically related images. By leveraging Cas9's off-target cleavage activity, query sequences cleave both exact and closely related targets, enabling high-fidelity retrieval of molecular addresses corresponding to in silico image clusters similar to the query. These approaches move towards addressing key challenges in molecular data retrieval by offering simplified, rapid isothermal protocols and new DNA data access capabilities.
由于其卓越的数据密度和耐久性,DNA是一种很有前景的数字数据存储介质。实用的基于DNA的存储系统需要选择性的数据检索,以尽量减少解码时间和成本。在这项工作中,我们引入了CRISPR-Cas9作为一种用户友好的工具,用于多路复用、低延迟的分子数据提取。我们首先提出了一种一锅法多路复用随机访问方法,其中使用CRISPR-Cas9寻址系统选择性地切割特定的数据文件,然后通过纳米孔技术进行测序。这种方法在包含25个独特数据文件的160万个DNA序列库上得到了验证。然后,我们开发了一种将机器学习与基于Cas9的检索相结合的分子相似性搜索方法。我们使用深度神经网络将一个包含174万张图像的数据库映射到一个降维嵌入中,将每个嵌入编码为一个Cas9靶序列。这些靶序列充当分子地址,捕获语义相关图像的聚类。通过利用Cas9的脱靶切割活性,查询序列切割精确和密切相关的靶标,从而能够高保真地检索与类似于查询的计算机模拟图像聚类相对应的分子地址。这些方法通过提供简化、快速的等温协议和新的DNA数据访问能力,朝着解决分子数据检索中的关键挑战迈进。