Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland.
Bioinformatics. 2018 Jun 1;34(11):1834-1840. doi: 10.1093/bioinformatics/bty023.
Nowadays, genome sequencing is frequently used in many research centers. In projects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial.
We present GTC (GenoType Compressor), a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in <4 GB, while the queries related to variants are answered in a fraction of a second.
GTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/gtc.
Supplementary data are available at Bioinformatics online.
如今,基因组测序在许多研究中心得到了广泛应用。在诸如单倍型参考联盟或外显子聚集联盟等项目中,大量的基因型数据库在大型人群中被确定。随着这些集合的规模不断增大,对于快速且节省内存的表示和搜索方法的需求变得至关重要。
我们提出了 GTC(基因型压缩器),这是一种用于表示庞大的遗传变异数据集合的新型压缩数据结构。它在压缩率和回答各种类型查询的时间方面明显优于现有解决方案。我们表明,大约 4 亿个 SNP 中约 6 万个单倍型的最大公共数据库可以存储在<4GB 内,而与变体相关的查询可以在几分之一秒内得到回答。
GTC 可以从 https://github.com/refresh-bio/GTC 或 http://sun.aei.polsl.pl/REFRESH/gtc 下载。
补充数据可在“Bioinformatics”在线获取。