大规模数据管理和分析的计算解决方案。

Computational solutions to large-scale data management and analysis.

机构信息

Pacific Biosciences, Menlo Park, California 94025, USA.

出版信息

Nat Rev Genet. 2010 Sep;11(9):647-57. doi: 10.1038/nrg2857.

DOI:10.1038/nrg2857

PMID:20717155

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3124937/

Abstract

Today we can generate hundreds of gigabases of DNA and RNA sequencing data in a week for less than US$5,000. The astonishing rate of data generation by these low-cost, high-throughput technologies in genomics is being matched by that of other technologies, such as real-time imaging and mass spectrometry-based flow cytometry. Success in the life sciences will depend on our ability to properly interpret the large-scale, high-dimensional data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics. Here we discuss how we can master the different types of computational environments that exist - such as cloud and heterogeneous computing - to successfully tackle our big data problems.

摘要

如今，我们每周可以在不到 5000 美元的成本下生成数百千兆字节的 DNA 和 RNA 测序数据。这些低成本、高通量技术在基因组学方面产生数据的惊人速度正在与其他技术相匹配，例如实时成像和基于质谱的流式细胞术。生命科学的成功将取决于我们正确解释这些技术生成的大规模、高维数据集的能力，而这反过来又要求我们采用信息学的进步。在这里，我们讨论如何掌握不同类型的计算环境 - 例如云和异构计算 - 以成功解决我们的大数据问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac8f/3124937/c643401f624f/nihms304947f1.jpg

相似文献

Computational solutions to large-scale data management and analysis.大规模数据管理和分析的计算解决方案。

Nat Rev Genet. 2010 Sep;11(9):647-57. doi: 10.1038/nrg2857.

Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing.生物信息学方法在基因组学和下一代测序的后基因组学应用。

Brief Bioinform. 2010 Mar;11(2):181-97. doi: 10.1093/bib/bbp046. Epub 2009 Oct 27.

User-centric genomics infrastructure: trends and technologies.以用户为中心的基因组学基础设施：趋势与技术。

Genome. 2021 Apr;64(4):467-475. doi: 10.1139/gen-2020-0096. Epub 2020 Nov 20.

A System Architecture for Efficient Transmission of Massive DNA Sequencing Data.一种用于高效传输海量DNA测序数据的系统架构。

J Comput Biol. 2017 Nov;24(11):1081-1088. doi: 10.1089/cmb.2017.0016. Epub 2017 Apr 17.

Computational solutions for omics data.计算方法在组学数据中的应用。

Nat Rev Genet. 2013 May;14(5):333-46. doi: 10.1038/nrg3433.

Computational methods for discovering structural variation with next-generation sequencing.利用下一代测序技术发现结构变异的计算方法

Nat Methods. 2009 Nov;6(11 Suppl):S13-20. doi: 10.1038/nmeth.1374.

CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.CloudDOE：一款用于部署Hadoop云并使用MapReduce分析高通量测序数据的用户友好型工具。

PLoS One. 2014 Jun 4;9(6):e98146. doi: 10.1371/journal.pone.0098146. eCollection 2014.

Current state-of-art of sequencing technologies for plant genomics research.植物基因组学研究中测序技术的最新进展。

Brief Funct Genomics. 2012 Jan;11(1):3-11. doi: 10.1093/bfgp/elr045.

Accumulating computational resource usage of genomic data analysis workflow to optimize cloud computing instance selection.积累基因组数据分析工作流程的计算资源使用情况，以优化云计算实例选择。

Gigascience. 2019 Apr 1;8(4). doi: 10.1093/gigascience/giz052.

Lessons learnt on the analysis of large sequence data in animal genomics.动物基因组学中大型序列数据分析的经验教训。

Anim Genet. 2018 Jun;49(3):147-158. doi: 10.1111/age.12655. Epub 2018 Apr 6.

引用本文的文献

Advancing the Spatiotemporal Dimension of Wildlife-Pollution Interactions.推进野生动物与污染相互作用的时空维度。

Environ Sci Technol Lett. 2025 Mar 18;12(4):358-370. doi: 10.1021/acs.estlett.5c00042. eCollection 2025 Apr 8.

VAREANT: a bioinformatics application for gene variant reduction and annotation.VAREANT：一款用于基因变异体减少和注释的生物信息学应用程序。

Bioinform Adv. 2024 Dec 31;5(1):vbae210. doi: 10.1093/bioadv/vbae210. eCollection 2025.

Clinical implementation of next-generation sequencing testing and genomically-matched therapy: a real-world data in a tertiary hospital.下一代测序检测与基因组匹配治疗的临床应用：一家三级医院的真实世界数据

Sci Rep. 2025 Jan 16;15(1):2171. doi: 10.1038/s41598-024-84909-9.

The Synergy of Machine Learning and Epidemiology in Addressing Carbapenem Resistance: A Comprehensive Review.机器学习与流行病学在应对碳青霉烯类耐药性方面的协同作用：全面综述

Antibiotics (Basel). 2024 Oct 21;13(10):996. doi: 10.3390/antibiotics13100996.

Deep learning in bioinformatics.生物信息学中的深度学习。

Turk J Biol. 2023 Dec 18;47(6):366-382. doi: 10.55730/1300-0152.2671. eCollection 2023.

Phenotypic variation seems not to be associated with the genetic profile in Zygopetalum (Orchidaceae): a case study of a high-elevation rocky complex.表型变异似乎与 Zygopetalum（兰科）的遗传特征无关：一个高海拔多岩石复合体的案例研究。

Mol Biol Rep. 2024 Apr 27;51(1):582. doi: 10.1007/s11033-024-09528-z.

Medicare meets the cloud: the development of a secure platform for the storage and analysis of claims data.医疗保险与云计算相遇：一个用于存储和分析理赔数据的安全平台的开发。

JAMIA Open. 2024 Feb 9;7(1):ooae007. doi: 10.1093/jamiaopen/ooae007. eCollection 2024 Apr.

Perovskite single-pixel detector for dual-color metasurface imaging recognition in complex environment.用于复杂环境中双色超表面成像识别的钙钛矿单像素探测器。

Light Sci Appl. 2023 Nov 27;12(1):286. doi: 10.1038/s41377-023-01311-2.

Journeying towards best practice data management in biodiversity genomics.迈向生物多样性基因组学最佳实践数据管理之路。

Mol Ecol Resour. 2025 Feb;25(2):e13880. doi: 10.1111/1755-0998.13880. Epub 2023 Oct 24.

Container Profiler: Profiling resource utilization of containerized big data pipelines.容器分析器：分析容器化大数据管道的资源利用情况。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad069. Epub 2023 Aug 25.

本文引用的文献

High-throughput Bayesian Network Learning using Heterogeneous Multicore Computers.使用异构多核计算机的高通量贝叶斯网络学习

ICS. 2010 Jun;2010:95-104. doi: 10.1145/1810085.1810101.

Third-generation sequencing fireworks at Marco Island.第三代测序技术在马可岛大放异彩。

Nat Biotechnol. 2010 May;28(5):426-8. doi: 10.1038/nbt0510-426.

Direct detection of DNA methylation during single-molecule, real-time sequencing.单分子实时测序中 DNA 甲基化的直接检测。

Nat Methods. 2010 Jun;7(6):461-5. doi: 10.1038/nmeth.1459. Epub 2010 May 9.

Direct sequencing of the human microbiome readily reveals community differences.直接对人类微生物组进行测序可以轻易揭示群落差异。

Genome Biol. 2010;11(5):210. doi: 10.1186/gb-2010-11-5-210. Epub 2010 May 5.

A human gut microbial gene catalogue established by metagenomic sequencing.宏基因组测序建立的人类肠道微生物基因目录。

Nature. 2010 Mar 4;464(7285):59-65. doi: 10.1038/nature08821.

VertNet: a new model for biodiversity data sharing.VertNet：一种新的生物多样性数据共享模式。

PLoS Biol. 2010 Feb 16;8(2):e1000309. doi: 10.1371/journal.pbio.1000309.

A Bayesian partition method for detecting pleiotropic and epistatic eQTL modules.基于贝叶斯的基因表达数量性状位点模块的上位性和多效性检测方法。

PLoS Comput Biol. 2010 Jan 15;6(1):e1000642. doi: 10.1371/journal.pcbi.1000642.

Up in a cloud?在云端？

Nat Biotechnol. 2010 Jan;28(1):13-5. doi: 10.1038/nbt0110-13.

Searching for SNPs with cloud computing.利用云计算搜索 SNP。

Genome Biol. 2009;10(11):R134. doi: 10.1186/gb-2009-10-11-r134. Epub 2009 Nov 20.

Bacterial community variation in human body habitats across space and time.人体不同空间和时间栖息地的细菌群落变化。

Science. 2009 Dec 18;326(5960):1694-7. doi: 10.1126/science.1177486. Epub 2009 Nov 5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验