Suppr
超能文献

欧洲核苷酸档案库的千万亿字节级创新。

Petabyte-scale innovations at the European Nucleotide Archive.

作者信息

Cochrane Guy, Akhtar Ruth, Bonfield James, Bower Lawrence, Demiralp Fehmi, Faruque Nadeem, Gibson Richard, Hoad Gemma, Hubbard Tim, Hunter Christopher, Jang Mikyung, Juhos Szilveszter, Leinonen Rasko, Leonard Steven, Lin Quan, Lopez Rodrigo, Lorenc Dariusz, McWilliam Hamish, Mukherjee Gaurab, Plaister Sheila, Radhakrishnan Rajesh, Robinson Stephen, Sobhany Siamak, Hoopen Petra Ten, Vaughan Robert, Zalunin Vadim, Birney Ewan

机构信息

EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

出版信息

Nucleic Acids Res. 2009 Jan;37(Database issue):D19-25. doi: 10.1093/nar/gkn765. Epub 2008 Oct 31.

DOI:10.1093/nar/gkn765

PMID:18978013

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2686451/

Abstract

Dramatic increases in the throughput of nucleotide sequencing machines, and the promise of ever greater performance, have thrust bioinformatics into the era of petabyte-scale data sets. Sequence repositories, which provide the feed for these data sets into the worldwide computational infrastructure, are challenged by the impact of these data volumes. The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/embl), comprising the EMBL Nucleotide Sequence Database and the Ensembl Trace Archive, has identified challenges in the storage, movement, analysis, interpretation and visualization of petabyte-scale data sets. We present here our new repository for next generation sequence data, a brief summary of contents of the ENA and provide details of major developments to submission pipelines, high-throughput rule-based validation infrastructure and data integration approaches.

摘要

核苷酸测序仪通量的急剧增加以及性能不断提升的前景，已将生物信息学推进到了千万亿字节规模数据集的时代。为全球计算基础设施提供这些数据集数据来源的序列数据库，正受到这些数据量的影响而面临挑战。由欧洲分子生物学实验室核苷酸序列数据库（EMBL Nucleotide Sequence Database）和Ensembl序列追踪数据库（Ensembl Trace Archive）组成的欧洲核苷酸档案库（ENA；http://www.ebi.ac.uk/embl），已明确了在千万亿字节规模数据集的存储、传输、分析、解读及可视化方面所面临的挑战。我们在此展示我们新的下一代序列数据存档库，简要概述ENA的内容，并详细介绍提交管道、基于规则的高通量验证基础设施及数据整合方法的主要进展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c14b/2686451/3a0786a1c882/gkn765f1.jpg

相似文献

Petabyte-scale innovations at the European Nucleotide Archive.

Nucleic Acids Res. 2009 Jan;37(Database issue):D19-25. doi: 10.1093/nar/gkn765. Epub 2008 Oct 31.

Priorities for nucleotide trace, sequence and annotation data capture at the Ensembl Trace Archive and the EMBL Nucleotide Sequence Database.

Nucleic Acids Res. 2008 Jan;36(Database issue):D5-12. doi: 10.1093/nar/gkm1018. Epub 2007 Nov 26.

The European Nucleotide Archive.

Nucleic Acids Res. 2011 Jan;39(Database issue):D28-31. doi: 10.1093/nar/gkq967. Epub 2010 Oct 23.

The EMBL Nucleotide Sequence Database.

Nucleic Acids Res. 2002 Jan 1;30(1):21-6. doi: 10.1093/nar/30.1.21.

Major submissions tool developments at the European Nucleotide Archive.

Nucleic Acids Res. 2012 Jan;40(Database issue):D43-7. doi: 10.1093/nar/gkr946. Epub 2011 Nov 12.

Content discovery and retrieval services at the European Nucleotide Archive.

Nucleic Acids Res. 2015 Jan;43(Database issue):D23-9. doi: 10.1093/nar/gku1129. Epub 2014 Nov 17.

The European Nucleotide Archive in 2020.

Nucleic Acids Res. 2021 Jan 8;49(D1):D82-D85. doi: 10.1093/nar/gkaa1028.

Improvements to services at the European Nucleotide Archive.

Nucleic Acids Res. 2010 Jan;38(Database issue):D39-45. doi: 10.1093/nar/gkp998. Epub 2009 Nov 11.

The European Nucleotide Archive in 2021.

Nucleic Acids Res. 2022 Jan 7;50(D1):D106-D110. doi: 10.1093/nar/gkab1051.

The European Nucleotide Archive in 2023.

Nucleic Acids Res. 2024 Jan 5;52(D1):D92-D97. doi: 10.1093/nar/gkad1067.

引用本文的文献

Chromosome-scale genome assembly and annotation of two geographically distinct strains of malaria vector Anopheles albimanus.

Sci Rep. 2025 Jun 3;15(1):19448. doi: 10.1038/s41598-025-01713-9.

Transforming Cardiovascular Care With Artificial Intelligence: From Discovery to Practice: JACC State-of-the-Art Review.

J Am Coll Cardiol. 2024 Jul 2;84(1):97-114. doi: 10.1016/j.jacc.2024.05.003.

UTexas Aptamer Database: the collection and long-term preservation of aptamer sequence information.

Nucleic Acids Res. 2024 Jan 5;52(D1):D351-D359. doi: 10.1093/nar/gkad959.

Scientific Discovery Games for Biomedical Research.

Annu Rev Biomed Data Sci. 2019 Jul;2(1):253-279. doi: 10.1146/annurev-biodatasci-072018-021139.

Fungal metabarcoding data integration framework for the MycoDiversity DataBase (MDDB).

J Integr Bioinform. 2020 May 28;17(1):20190046. doi: 10.1515/jib-2019-0046.

Converting DNA and chemical fingerprints into two-dimensional barcode.

J Ginseng Res. 2017 Jul;41(3):339-346. doi: 10.1016/j.jgr.2016.06.006. Epub 2016 Jul 21.

Reconstructing 16S rRNA genes in metagenomic data.

Bioinformatics. 2015 Jun 15;31(12):i35-43. doi: 10.1093/bioinformatics/btv231.

ArrayExpress update--simplifying data submissions.

Nucleic Acids Res. 2015 Jan;43(Database issue):D1113-6. doi: 10.1093/nar/gku1057. Epub 2014 Oct 31.

Integrating pathways of Parkinson's disease in a molecular interaction map.

Mol Neurobiol. 2014 Feb;49(1):88-102. doi: 10.1007/s12035-013-8489-4. Epub 2013 Jul 7.

Building models using Reactome pathways as templates.

Methods Mol Biol. 2013;1021:273-83. doi: 10.1007/978-1-62703-450-0_14.

本文引用的文献

High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi.

Nat Genet. 2008 Aug;40(8):987-93. doi: 10.1038/ng.195. Epub 2008 Jul 27.

The minimum information about a genome sequence (MIGS) specification.

Nat Biotechnol. 2008 May;26(5):541-7. doi: 10.1038/nbt1360.

Comparative analysis of Acinetobacters: three genomes for three lifestyles.

PLoS One. 2008 Mar 19;3(3):e1805. doi: 10.1371/journal.pone.0001805.

The Mouse Genome Database (MGD): mouse biology and model systems.

Nucleic Acids Res. 2008 Jan;36(Database issue):D724-8. doi: 10.1093/nar/gkm961. Epub 2007 Dec 23.

GenBank.

Nucleic Acids Res. 2008 Jan;36(Database issue):D25-30. doi: 10.1093/nar/gkm929. Epub 2007 Dec 11.

Database resources of the National Center for Biotechnology Information.

Nucleic Acids Res. 2008 Jan;36(Database issue):D13-21. doi: 10.1093/nar/gkm1000. Epub 2007 Nov 27.

The universal protein resource (UniProt).

Nucleic Acids Res. 2008 Jan;36(Database issue):D190-5. doi: 10.1093/nar/gkm895. Epub 2007 Nov 27.

Priorities for nucleotide trace, sequence and annotation data capture at the Ensembl Trace Archive and the EMBL Nucleotide Sequence Database.

Nucleic Acids Res. 2008 Jan;36(Database issue):D5-12. doi: 10.1093/nar/gkm1018. Epub 2007 Nov 26.

Ensembl 2008.

Nucleic Acids Res. 2008 Jan;36(Database issue):D707-14. doi: 10.1093/nar/gkm988. Epub 2007 Nov 13.

The HGNC Database in 2008: a resource for the human genome.

Nucleic Acids Res. 2008 Jan;36(Database issue):D445-8. doi: 10.1093/nar/gkm881. Epub 2007 Nov 4.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

欧洲核苷酸档案库的千万亿字节级创新。

Petabyte-scale innovations at the European Nucleotide Archive.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译