Feng Jianglin, Sheffield Nathan C
Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22903, USA.
Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22903, USA.
Bioinformatics. 2021 Apr 9;37(1):118-120. doi: 10.1093/bioinformatics/btaa1062.
Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions.
https://github.com/databio/IGD.
Supplementary data are available at Bioinformatics online.
大规模基因组项目的数据库现在包含数千个基因组区间数据集。这些数据是理解DNA功能的关键资源。然而,我们检查和整合这种规模区间数据的能力是有限的。在这里,我们介绍了整合基因组数据库(IGD),这是一种方法和工具,用于搜索基因组区间数据集,其速度比现有方法快三个数量级以上,同时仅使用现有方法百分之一的内存。IGD使用一种新颖的线性装箱方法,使我们能够将分析扩展到数十亿个基因组区域。
https://github.com/databio/IGD。
补充数据可在《生物信息学》在线获取。