Suppr超能文献

使用非关系型数据库存储大型基因组数据文件和新型表型的实际意义。

Practical implications of using non-relational databases to store large genomic data files and novel phenotypes.

机构信息

Institute of Mathematics and Computer Sciences, University of Sao Paulo, Sao Carlos, Sao Paulo, Brazil.

Department of Animal Nutrition and Production, School of Veterinary Medicine and Animal Science, University of Sao Paulo, Pirassununga, Sao Paulo, Brazil.

出版信息

J Anim Breed Genet. 2022 Jan;139(1):100-112. doi: 10.1111/jbg.12644. Epub 2021 Aug 29.

Abstract

The objective of our study was to provide practical directions on the storage of genomic information and novel phenotypes (treated here as unstructured data) using a non-relational database. The MongoDB technology was assessed for this purpose, enabling frequent data transactions involving numerous individuals under genetic evaluation. Our study investigated different genomic (Illumina Final Report, PLINK, 0125, FASTQ, and VCF formats) and phenotypic (including media files) information, using both real and simulated datasets. Advantages of our centralized database concept include the sublinear running time for queries after increasing the number of samples/markers exponentially, in addition to the comprehensive management of distinct data formats while searching for specific genomic regions. A comparison of our non-relational and generic solution, with an existing relational approach (developed for tabular data types using 2 bits to store genotypes), showed reduced importing time to handle 50M SNPs (PLINK format) achieved by the relational schema. Our experimental results also reinforce that data conversion is a costly step required to manage genomic data into both relational and non-relational database systems, and therefore, must be carefully treated for large applications.

摘要

我们的研究目的是提供关于使用非关系数据库存储基因组信息和新型表型(此处视为非结构化数据)的实用指南。为此,评估了 MongoDB 技术,使其能够在遗传评估下频繁地进行涉及大量个体的数据交易。我们的研究使用真实和模拟数据集调查了不同的基因组(Illumina 最终报告、PLINK、0125、FASTQ 和 VCF 格式)和表型(包括媒体文件)信息。我们的集中式数据库概念的优点包括,在指数增加样本/标记数量后,查询的运行时间呈次线性增加,同时在搜索特定基因组区域时,能够全面管理不同的数据格式。我们的非关系和通用解决方案与现有的关系方法(使用 2 位二进制数存储基因型,为表格数据类型开发)进行比较,显示出在处理 5000 万个 SNPs(PLINK 格式)时,关系模式的导入时间减少。我们的实验结果还证实,数据转换是将基因组数据管理到关系和非关系数据库系统中所需的昂贵步骤,因此,对于大型应用程序,必须谨慎处理。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验