使用非关系型数据库存储大型基因组数据文件和新型表型的实际意义。

Practical implications of using non-relational databases to store large genomic data files and novel phenotypes.

机构信息

Institute of Mathematics and Computer Sciences, University of Sao Paulo, Sao Carlos, Sao Paulo, Brazil.

Department of Animal Nutrition and Production, School of Veterinary Medicine and Animal Science, University of Sao Paulo, Pirassununga, Sao Paulo, Brazil.

出版信息

J Anim Breed Genet. 2022 Jan;139(1):100-112. doi: 10.1111/jbg.12644. Epub 2021 Aug 29.

DOI:10.1111/jbg.12644

PMID:34459042

Abstract

The objective of our study was to provide practical directions on the storage of genomic information and novel phenotypes (treated here as unstructured data) using a non-relational database. The MongoDB technology was assessed for this purpose, enabling frequent data transactions involving numerous individuals under genetic evaluation. Our study investigated different genomic (Illumina Final Report, PLINK, 0125, FASTQ, and VCF formats) and phenotypic (including media files) information, using both real and simulated datasets. Advantages of our centralized database concept include the sublinear running time for queries after increasing the number of samples/markers exponentially, in addition to the comprehensive management of distinct data formats while searching for specific genomic regions. A comparison of our non-relational and generic solution, with an existing relational approach (developed for tabular data types using 2 bits to store genotypes), showed reduced importing time to handle 50M SNPs (PLINK format) achieved by the relational schema. Our experimental results also reinforce that data conversion is a costly step required to manage genomic data into both relational and non-relational database systems, and therefore, must be carefully treated for large applications.

摘要

我们的研究目的是提供关于使用非关系数据库存储基因组信息和新型表型（此处视为非结构化数据）的实用指南。为此，评估了 MongoDB 技术，使其能够在遗传评估下频繁地进行涉及大量个体的数据交易。我们的研究使用真实和模拟数据集调查了不同的基因组（Illumina 最终报告、PLINK、0125、FASTQ 和 VCF 格式）和表型（包括媒体文件）信息。我们的集中式数据库概念的优点包括，在指数增加样本/标记数量后，查询的运行时间呈次线性增加，同时在搜索特定基因组区域时，能够全面管理不同的数据格式。我们的非关系和通用解决方案与现有的关系方法（使用 2 位二进制数存储基因型，为表格数据类型开发）进行比较，显示出在处理 5000 万个 SNPs（PLINK 格式）时，关系模式的导入时间减少。我们的实验结果还证实，数据转换是将基因组数据管理到关系和非关系数据库系统中所需的昂贵步骤，因此，对于大型应用程序，必须谨慎处理。

相似文献

Practical implications of using non-relational databases to store large genomic data files and novel phenotypes.使用非关系型数据库存储大型基因组数据文件和新型表型的实际意义。

J Anim Breed Genet. 2022 Jan;139(1):100-112. doi: 10.1111/jbg.12644. Epub 2021 Aug 29.

Evaluation of relational and NoSQL database architectures to manage genomic annotations.用于管理基因组注释的关系型和非关系型数据库架构评估。

J Biomed Inform. 2016 Dec;64:288-295. doi: 10.1016/j.jbi.2016.10.015. Epub 2016 Oct 31.

High density genotype storage for plant breeding in the Chado schema of Breedbase.高密度基因型存储在 Breedbase 的 Chado 模式中，用于植物育种。

PLoS One. 2020 Nov 11;15(11):e0240059. doi: 10.1371/journal.pone.0240059. eCollection 2020.

Executing Complexity-Increasing Queries in Relational (MySQL) and NoSQL (MongoDB and EXist) Size-Growing ISO/EN 13606 Standardized EHR Databases.在关系型（MySQL）和非关系型（MongoDB和EXist）且规模不断增长的ISO/EN 13606标准化电子健康记录数据库中执行复杂度递增查询。

J Vis Exp. 2018 Mar 19(133):57439. doi: 10.3791/57439.

Rapid storage and retrieval of genomic intervals from a relational database system using nested containment lists.使用嵌套包含列表从关系型数据库系统中快速存储和检索基因组区间。

Database (Oxford). 2013 Jul 26;2013:bat056. doi: 10.1093/database/bat056. Print 2013.

FASTdoop: a versatile and efficient library for the input of FASTA and FASTQ files for MapReduce Hadoop bioinformatics applications.FASTdoop：一个通用且高效的库，用于为MapReduce Hadoop生物信息学应用输入FASTA和FASTQ文件。

Bioinformatics. 2017 May 15;33(10):1575-1577. doi: 10.1093/bioinformatics/btx010.

Dynamic tables: an architecture for managing evolving, heterogeneous biomedical data in relational database management systems.动态表：一种用于在关系数据库管理系统中管理不断演变的异构生物医学数据的架构。

J Am Med Inform Assoc. 2007 Jan-Feb;14(1):86-93. doi: 10.1197/jamia.M2189. Epub 2006 Oct 26.

SNPLims: a data management system for genome wide association studies.SNPLims：一种用于全基因组关联研究的数据管理系统。

BMC Bioinformatics. 2008 Mar 26;9 Suppl 2(Suppl 2):S13. doi: 10.1186/1471-2105-9-S2-S13.

An adaptive spark-based framework for querying large-scale NoSQL and relational databases.一种适用于查询大规模 NoSQL 和关系型数据库的基于火花的自适应框架。

PLoS One. 2021 Aug 19;16(8):e0255562. doi: 10.1371/journal.pone.0255562. eCollection 2021.

Managing large SNP datasets with SNPpy.使用SNPpy管理大型单核苷酸多态性（SNP）数据集。

Methods Mol Biol. 2013;1019:99-127. doi: 10.1007/978-1-62703-447-0_4.

引用本文的文献

Applications of livestock monitoring devices and machine learning algorithms in animal production and reproduction: an overview.畜牧监测设备与机器学习算法在动物生产与繁殖中的应用：综述

Anim Reprod. 2023 Aug 28;20(2):e20230077. doi: 10.1590/1984-3143-AR2023-0077. eCollection 2023.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用非关系型数据库存储大型基因组数据文件和新型表型的实际意义。

Practical implications of using non-relational databases to store large genomic data files and novel phenotypes.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献