Institute of Biotechnology, Cornell University.
Boyce Thompson Institute.
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz096.
With high-throughput genotyping systems now available, it has become feasible to fully integrate genotyping information into breeding programs. To make use of this information effectively requires DNA extraction facilities and marker production facilities that can efficiently deploy the desired set of markers across samples with a rapid turnaround time that allows for selection before crosses needed to be made. In reality, breeders often have a short window of time to make decisions by the time they are able to collect all their phenotyping data and receive corresponding genotyping data. This presents a challenge to organize information and utilize it in downstream analyses to support decisions made by breeders. In order to implement genomic selection routinely as part of breeding programs, one would need an efficient genotyping data storage system. We selected and benchmarked six popular open-source data storage systems, including relational database management and columnar storage systems.
We found that data extract times are greatly influenced by the orientation in which genotype data is stored in a system. HDF5 consistently performed best, in part because it can more efficiently work with both orientations of the allele matrix.
http://gobiin1.bti.cornell.edu:6083/projects/GBM/repos/benchmarking/browse.
随着高通量基因分型系统的出现,将基因分型信息完全整合到育种计划中已成为可能。为了有效利用这些信息,需要具备 DNA 提取设施和标记生产设施,这些设施能够在快速周转时间内高效地在样本中部署所需的标记集,以便在需要进行杂交之前进行选择。实际上,育种者通常只有很短的时间窗口来做出决策,直到他们能够收集所有的表型数据并收到相应的基因分型数据。这给组织信息并在下游分析中利用这些信息来支持育种者做出的决策带来了挑战。为了将基因组选择常规地作为育种计划的一部分实施,人们需要一个高效的基因分型数据存储系统。我们选择并基准测试了六个流行的开源数据存储系统,包括关系型数据库管理系统和列式存储系统。
我们发现,数据提取时间极大地受到系统中基因分型数据存储方向的影响。HDF5 始终表现最佳,部分原因是它可以更有效地处理等位基因矩阵的两种方向。
http://gobiin1.bti.cornell.edu:6083/projects/GBM/repos/benchmarking/browse.