KoNA：韩国核苷酸档案库作为核苷酸序列数据的新数据存储库。

KoNA: Korean Nucleotide Archive as A New Data Repository for Nucleotide Sequence Data.

机构信息

Korea Bioinformation Center, Korea Research Institute of Bioscience & Biotechnology, Daejeon 34141, Republic of Korea.

出版信息

Genomics Proteomics Bioinformatics. 2024 May 9;22(1). doi: 10.1093/gpbjnl/qzae017.

Abstract

During the last decade, the generation and accumulation of petabase-scale high-throughput sequencing data have resulted in great challenges, including access to human data, as well as transfer, storage, and sharing of enormous amounts of data. To promote data-driven biological research, the Korean government announced that all biological data generated from government-funded research projects should be deposited at the Korea BioData Station (K-BDS), which consists of multiple databases for individual data types. Here, we introduce the Korean Nucleotide Archive (KoNA), a repository of nucleotide sequence data. As of July 2022, the Korean Read Archive in KoNA has collected over 477 TB of raw next-generation sequencing data from national genome projects. To ensure data quality and prepare for international alignment, a standard operating procedure was adopted, which is similar to that of the International Nucleotide Sequence Database Collaboration. The standard operating procedure includes quality control processes for submitted data and metadata using an automated pipeline, followed by manual examination. To ensure fast and stable data transfer, a high-speed transmission system called GBox is used in KoNA. Furthermore, the data uploaded to or downloaded from KoNA through GBox can be readily processed using a cloud computing service called Bio-Express. This seamless coupling of KoNA, GBox, and Bio-Express enhances the data experience, including submission, access, and analysis of raw nucleotide sequences. KoNA not only satisfies the unmet needs for a national sequence repository in Korea but also provides datasets to researchers globally and contributes to advances in genomics. The KoNA is available at https://www.kobic.re.kr/kona/.

摘要

在过去的十年中，大规模高通量测序数据的产生和积累带来了巨大的挑战，包括获取人类数据，以及传输、存储和共享大量数据。为了促进数据驱动的生物研究，韩国政府宣布，所有由政府资助的研究项目产生的生物数据都应存入韩国生物数据站（K-BDS），该站由多个数据库组成，用于存储不同类型的数据。在这里，我们介绍核苷酸序列数据存储库——韩国核苷酸档案库（KoNA）。截至 2022 年 7 月，KoNA 中的韩国读取档案已收集了超过 477TB 的来自国家基因组计划的原始下一代测序数据。为了确保数据质量并为国际对齐做好准备，采用了类似于国际核苷酸序列数据库合作组织的标准操作程序。该标准操作程序包括使用自动化管道对提交的数据和元数据进行质量控制处理，然后进行手动检查。为了确保快速稳定的数据传输，KoNA 中使用了一种名为 GBox 的高速传输系统。此外，通过 GBox 上传到 KoNA 或从 KoNA 下载的数据可以使用名为 Bio-Express 的云计算服务进行快速处理。KoNA、GBox 和 Bio-Express 的这种无缝耦合增强了数据体验，包括原始核苷酸序列的提交、访问和分析。KoNA 不仅满足了韩国对国家序列存储库的未满足需求，还为全球研究人员提供了数据集，并为基因组学的发展做出了贡献。KoNA 可在 https://www.kobic.re.kr/kona/ 上获取。