Cooler：用于Hi-C数据和其他基因组标记阵列的可扩展存储。

Cooler: scalable storage for Hi-C data and other genomically labeled arrays.

作者信息

Abdennur Nezar, Mirny Leonid A

机构信息

Institute for Medical Engineering and Science, Cambridge, MA 02139, USA.

Department of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.

出版信息

Bioinformatics. 2020 Jan 1;36(1):311-316. doi: 10.1093/bioinformatics/btz540.

DOI:10.1093/bioinformatics/btz540

PMID:31290943

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8205516/

Abstract

MOTIVATION

Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis.

RESULTS

We developed a file format called cooler, based on a sparse data model, that can support genomically labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium.

AVAILABILITY AND IMPLEMENTATION

Cooler is cross-platform, BSD-licensed and can be installed from the Python package index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

大多数现有的基于覆盖度的（表观）基因组数据集都是一维的，但用于探测相互作用（物理、遗传等）的新技术会产生具有二维基因组坐标系的定量图谱。当以密集形式存储此类图谱时，存储和计算成本会随着数据分辨率急剧增加。因此，迫切需要开发数据存储策略，利用多维基因组数据集的稀疏特性来处理其全范围的有用分辨率，同时支持高效压缩并提供快速随机访问，以促进用于数据分析的可扩展算法的开发。

结果

我们基于稀疏数据模型开发了一种名为cooler的文件格式，它可以支持任何分辨率下的基因组标记矩阵。它具有灵活性，能够适应数据轴（基因组坐标、轨迹和区间注释）、分辨率、数据密度模式和元数据的各种描述。Cooler基于HDF5，并由一个Python库和命令行套件提供支持，用于创建、读取、检查和操作cooler数据集合。该格式已被美国国立卫生研究院4D核体联盟采纳为标准。

可用性和实现方式

Cooler是跨平台的，遵循BSD许可，可以从Python包索引或生物conda仓库安装。源代码托管在Github上，网址为https://github.com/mirnylab/cooler。

补充信息

补充数据可在《生物信息学》在线获取。

相似文献

Cooler: scalable storage for Hi-C data and other genomically labeled arrays.Cooler：用于Hi-C数据和其他基因组标记阵列的可扩展存储。

Bioinformatics. 2020 Jan 1;36(1):311-316. doi: 10.1093/bioinformatics/btz540.

Scool: a new data storage format for single-cell Hi-C data.Scool：一种用于单细胞 Hi-C 数据的新型数据存储格式。

Bioinformatics. 2021 Aug 4;37(14):2053-2054. doi: 10.1093/bioinformatics/btaa924.

The GCTx format and cmap{Py, R, M, J} packages: resources for optimized storage and integrated traversal of annotated dense matrices.GCTx 格式和 cmap{Py, R, M, J} 包：用于优化存储和注释密集矩阵集成遍历的资源。

Bioinformatics. 2019 Apr 15;35(8):1427-1429. doi: 10.1093/bioinformatics/bty784.

hictk: blazing fast toolkit to work with .hic and .cool files.hicTK：用于处理.hic 和.cool 文件的快速工具包。

Bioinformatics. 2024 Jul 1;40(7). doi: 10.1093/bioinformatics/btae408.

Cooltools: Enabling high-resolution Hi-C analysis in Python.酷工具：在 Python 中实现高分辨率 Hi-C 分析。

PLoS Comput Biol. 2024 May 6;20(5):e1012067. doi: 10.1371/journal.pcbi.1012067. eCollection 2024 May.

grabseqs: simple downloading of reads and metadata from multiple next-generation sequencing data repositories.grabseqs：从多个下一代测序数据存储库中简单地下载读取和元数据。

Bioinformatics. 2020 Jun 1;36(11):3607-3609. doi: 10.1093/bioinformatics/btaa167.

Pairs and Pairix: a file format and a tool for efficient storage and retrieval for Hi-C read pairs.Pairs 和 Pairix：一种用于高效存储和检索 Hi-C 读对的文件格式和工具。

Bioinformatics. 2022 Mar 4;38(6):1729-1731. doi: 10.1093/bioinformatics/btab870.

Efficient querying of genomic reference databases with gget.使用 gget 高效查询基因组参考数据库。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac836.

The metabolomics workbench file status website: a metadata repository promoting FAIR principles of metabolomics data.代谢组学工作台文件状态网站：一个促进代谢组学数据 FAIR 原则的元数据存储库。

BMC Bioinformatics. 2023 Jul 24;24(1):299. doi: 10.1186/s12859-023-05423-9.

ODGI: understanding pangenome graphs.ODGI：理解泛基因组图谱。

Bioinformatics. 2022 Jun 27;38(13):3319-3326. doi: 10.1093/bioinformatics/btac308.

引用本文的文献

Genetic variants affecting RNA stability influence complex traits and disease risk.影响RNA稳定性的基因变异会影响复杂性状和疾病风险。

Nat Genet. 2025 Sep 5. doi: 10.1038/s41588-025-02326-8.

The genome sequence of a cranefly, ( ) Loew, 1873.一种大蚊（大蚊属，洛伊，1873年）的基因组序列。（括号内内容缺失具体物种名，无法完整准确翻译该部分）

Wellcome Open Res. 2024 Oct 16;9:597. doi: 10.12688/wellcomeopenres.23207.1. eCollection 2024.

A 3D genome compendium of breast cancer progression.一份乳腺癌进展的三维基因组纲要。

iScience. 2025 Aug 5;28(9):113268. doi: 10.1016/j.isci.2025.113268. eCollection 2025 Sep 19.

The genome sequence of the virgin bagworm, (Stephens, 1850).处女蓑蛾（斯蒂芬斯，1850年）的基因组序列。

Wellcome Open Res. 2025 Feb 26;10:108. doi: 10.12688/wellcomeopenres.23768.1. eCollection 2025.

The chromosomal genome sequence of the giant barrel sponge, Schmidt 1870 and its associated microbial metagenome sequences.巨型桶状海绵（施密特，1870年）的染色体基因组序列及其相关微生物宏基因组序列。

Wellcome Open Res. 2025 Jul 8;10:336. doi: 10.12688/wellcomeopenres.24173.1. eCollection 2025.

The scaffold-level genome sequence of an encrusting sponge, Vacelet & Donadey, 1987, and its associated microbial metagenome sequences.一种覆盖型海绵（Vacelet & Donadey，1987）的支架水平基因组序列及其相关的微生物宏基因组序列。

Wellcome Open Res. 2025 Jul 9;10:344. doi: 10.12688/wellcomeopenres.24281.1. eCollection 2025.

Interactions between the genome and the nuclear lamina are multivalent and cooperative.基因组与核纤层之间的相互作用是多价且协同的。

Nat Struct Mol Biol. 2025 Sep 1. doi: 10.1038/s41594-025-01655-w.

Leveraging Hi-C Data to Detect Chromosomal Reorganizations.利用Hi-C数据检测染色体重组。

Methods Mol Biol. 2025;2968:213-230. doi: 10.1007/978-1-0716-4750-9_12.

The genome sequence of the Eurasian Curlew, (Linnaeus, 1758).白腰杓鹬（欧亚杓鹬）的基因组序列，（林奈，1758年）

Wellcome Open Res. 2025 Jun 2;10:298. doi: 10.12688/wellcomeopenres.24272.1. eCollection 2025.

The genome sequence of the acute-angled fungus beetle, Gyllenhal, 1827.1827年吉伦哈尔锐角菌甲的基因组序列。

Wellcome Open Res. 2025 Feb 10;10:58. doi: 10.12688/wellcomeopenres.23688.1. eCollection 2025.

本文引用的文献

The 3D Genome Browser: a web-based browser for visualizing 3D genome organization and long-range chromatin interactions.3D 基因组浏览器：一个用于可视化 3D 基因组组织和长距离染色质相互作用的基于网络的浏览器。

Genome Biol. 2018 Oct 4;19(1):151. doi: 10.1186/s13059-018-1519-9.

HiGlass: web-based visual exploration and analysis of genome interaction maps.HiGlass：基于网络的基因组互作图谱可视化探索和分析工具

Genome Biol. 2018 Aug 24;19(1):125. doi: 10.1186/s13059-018-1486-1.

Bioconda: sustainable and comprehensive software distribution for the life sciences.生物conda：面向生命科学的可持续且全面的软件发行平台。

Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7.

Galaxy HiCExplorer: a web server for reproducible Hi-C data analysis, quality control and visualization.Galaxy HiCExplorer：一个用于可重现的 Hi-C 数据分析、质量控制和可视化的网络服务器。

Nucleic Acids Res. 2018 Jul 2;46(W1):W11-W16. doi: 10.1093/nar/gky504.

Storage, visualization, and navigation of 3D genomics data.三维基因组学数据的存储、可视化和导航。

Methods. 2018 Jun 1;142:74-80. doi: 10.1016/j.ymeth.2018.05.008. Epub 2018 May 22.

Genome contact map explorer: a platform for the comparison, interactive visualization and analysis of genome contact maps.基因组接触图谱浏览器：一个用于比较、交互式可视化和分析基因组接触图谱的平台。

Nucleic Acids Res. 2017 Sep 29;45(17):e152. doi: 10.1093/nar/gkx644.

BioContainers: an open-source and community-driven framework for software standardization.生物容器：一个开源且由社区驱动的软件标准化框架。

Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192.

How best to identify chromosomal interactions: a comparison of approaches.如何最好地识别染色体相互作用：方法比较。

Nat Methods. 2017 Jan 31;14(2):125-134. doi: 10.1038/nmeth.4146.

Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom.Juicebox 提供了一个 Hi-C 接触图谱的可视化系统，支持无限缩放。

Cell Syst. 2016 Jul;3(1):99-101. doi: 10.1016/j.cels.2015.07.012.

The second decade of 3C technologies: detailed insights into nuclear organization.3C技术的第二个十年：对细胞核组织的深入洞察

Genes Dev. 2016 Jun 15;30(12):1357-82. doi: 10.1101/gad.281964.116.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验