Suppr超能文献

Cooler:用于Hi-C数据和其他基因组标记阵列的可扩展存储。

Cooler: scalable storage for Hi-C data and other genomically labeled arrays.

作者信息

Abdennur Nezar, Mirny Leonid A

机构信息

Institute for Medical Engineering and Science, Cambridge, MA 02139, USA.

Department of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.

出版信息

Bioinformatics. 2020 Jan 1;36(1):311-316. doi: 10.1093/bioinformatics/btz540.

Abstract

MOTIVATION

Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis.

RESULTS

We developed a file format called cooler, based on a sparse data model, that can support genomically labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium.

AVAILABILITY AND IMPLEMENTATION

Cooler is cross-platform, BSD-licensed and can be installed from the Python package index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

大多数现有的基于覆盖度的(表观)基因组数据集都是一维的,但用于探测相互作用(物理、遗传等)的新技术会产生具有二维基因组坐标系的定量图谱。当以密集形式存储此类图谱时,存储和计算成本会随着数据分辨率急剧增加。因此,迫切需要开发数据存储策略,利用多维基因组数据集的稀疏特性来处理其全范围的有用分辨率,同时支持高效压缩并提供快速随机访问,以促进用于数据分析的可扩展算法的开发。

结果

我们基于稀疏数据模型开发了一种名为cooler的文件格式,它可以支持任何分辨率下的基因组标记矩阵。它具有灵活性,能够适应数据轴(基因组坐标、轨迹和区间注释)、分辨率、数据密度模式和元数据的各种描述。Cooler基于HDF5,并由一个Python库和命令行套件提供支持,用于创建、读取、检查和操作cooler数据集合。该格式已被美国国立卫生研究院4D核体联盟采纳为标准。

可用性和实现方式

Cooler是跨平台的,遵循BSD许可,可以从Python包索引或生物conda仓库安装。源代码托管在Github上,网址为https://github.com/mirnylab/cooler。

补充信息

补充数据可在《生物信息学》在线获取。

相似文献

5
Cooltools: Enabling high-resolution Hi-C analysis in Python.酷工具:在 Python 中实现高分辨率 Hi-C 分析。
PLoS Comput Biol. 2024 May 6;20(5):e1012067. doi: 10.1371/journal.pcbi.1012067. eCollection 2024 May.
10
ODGI: understanding pangenome graphs.ODGI:理解泛基因组图谱。
Bioinformatics. 2022 Jun 27;38(13):3319-3326. doi: 10.1093/bioinformatics/btac308.

引用本文的文献

3
A 3D genome compendium of breast cancer progression.一份乳腺癌进展的三维基因组纲要。
iScience. 2025 Aug 5;28(9):113268. doi: 10.1016/j.isci.2025.113268. eCollection 2025 Sep 19.
4
The genome sequence of the virgin bagworm, (Stephens, 1850).处女蓑蛾(斯蒂芬斯,1850年)的基因组序列。
Wellcome Open Res. 2025 Feb 26;10:108. doi: 10.12688/wellcomeopenres.23768.1. eCollection 2025.
8
Leveraging Hi-C Data to Detect Chromosomal Reorganizations.利用Hi-C数据检测染色体重组。
Methods Mol Biol. 2025;2968:213-230. doi: 10.1007/978-1-0716-4750-9_12.
9
The genome sequence of the Eurasian Curlew, (Linnaeus, 1758).白腰杓鹬(欧亚杓鹬)的基因组序列,(林奈,1758年)
Wellcome Open Res. 2025 Jun 2;10:298. doi: 10.12688/wellcomeopenres.24272.1. eCollection 2025.
10
The genome sequence of the acute-angled fungus beetle, Gyllenhal, 1827.1827年吉伦哈尔锐角菌甲的基因组序列。
Wellcome Open Res. 2025 Feb 10;10:58. doi: 10.12688/wellcomeopenres.23688.1. eCollection 2025.

本文引用的文献

5
Storage, visualization, and navigation of 3D genomics data.三维基因组学数据的存储、可视化和导航。
Methods. 2018 Jun 1;142:74-80. doi: 10.1016/j.ymeth.2018.05.008. Epub 2018 May 22.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验