Abdennur Nezar, Mirny Leonid A
Institute for Medical Engineering and Science, Cambridge, MA 02139, USA.
Department of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
Bioinformatics. 2020 Jan 1;36(1):311-316. doi: 10.1093/bioinformatics/btz540.
Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis.
We developed a file format called cooler, based on a sparse data model, that can support genomically labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium.
Cooler is cross-platform, BSD-licensed and can be installed from the Python package index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler.
Supplementary data are available at Bioinformatics online.
大多数现有的基于覆盖度的(表观)基因组数据集都是一维的,但用于探测相互作用(物理、遗传等)的新技术会产生具有二维基因组坐标系的定量图谱。当以密集形式存储此类图谱时,存储和计算成本会随着数据分辨率急剧增加。因此,迫切需要开发数据存储策略,利用多维基因组数据集的稀疏特性来处理其全范围的有用分辨率,同时支持高效压缩并提供快速随机访问,以促进用于数据分析的可扩展算法的开发。
我们基于稀疏数据模型开发了一种名为cooler的文件格式,它可以支持任何分辨率下的基因组标记矩阵。它具有灵活性,能够适应数据轴(基因组坐标、轨迹和区间注释)、分辨率、数据密度模式和元数据的各种描述。Cooler基于HDF5,并由一个Python库和命令行套件提供支持,用于创建、读取、检查和操作cooler数据集合。该格式已被美国国立卫生研究院4D核体联盟采纳为标准。
Cooler是跨平台的,遵循BSD许可,可以从Python包索引或生物conda仓库安装。源代码托管在Github上,网址为https://github.com/mirnylab/cooler。
补充数据可在《生物信息学》在线获取。