大数据生物信息学软件的可扩展性与验证

Scalability and Validation of Big Data Bioinformatics Software.

作者信息

Yang Andrian, Troup Michael, Ho Joshua W K

机构信息

Victor Chang Cardiac Research Institute, Darlinghurst, NSW 2010, Australia.

St. Vincent's Clinical School, University of New South Wales, Darlinghurst, NSW 2010, Australia.

出版信息

Comput Struct Biotechnol J. 2017 Jul 20;15:379-386. doi: 10.1016/j.csbj.2017.07.002. eCollection 2017.

DOI:10.1016/j.csbj.2017.07.002

PMID:28794828

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5537105/

Abstract

This review examines two important aspects that are central to modern big data bioinformatics analysis - software scalability and validity. We argue that not only are the issues of scalability and validation common to all big data bioinformatics analyses, they can be tackled by conceptually related methodological approaches, namely divide-and-conquer (scalability) and multiple executions (validation). Scalability is defined as the ability for a program to scale based on workload. It has always been an important consideration when developing bioinformatics algorithms and programs. Nonetheless the surge of volume and variety of biological and biomedical data has posed new challenges. We discuss how modern cloud computing and big data programming frameworks such as MapReduce and Spark are being used to effectively implement divide-and-conquer in a distributed computing environment. Validation of software is another important issue in big data bioinformatics that is often ignored. Software validation is the process of determining whether the program under test fulfils the task for which it was designed. Determining the correctness of the computational output of big data bioinformatics software is especially difficult due to the large input space and complex algorithms involved. We discuss how state-of-the-art software testing techniques that are based on the idea of multiple executions, such as metamorphic testing, can be used to implement an effective bioinformatics quality assurance strategy. We hope this review will raise awareness of these critical issues in bioinformatics.

摘要

本综述探讨了现代大数据生物信息学分析的两个核心重要方面——软件可扩展性和有效性。我们认为，可扩展性和验证问题不仅是所有大数据生物信息学分析所共有的，而且可以通过概念上相关的方法来解决，即分治法（可扩展性）和多次执行（验证）。可扩展性被定义为程序根据工作量进行扩展的能力。在开发生物信息学算法和程序时，它一直是一个重要的考虑因素。尽管如此，生物和生物医学数据的数量和种类激增带来了新的挑战。我们讨论了如何利用现代云计算和大数据编程框架（如MapReduce和Spark）在分布式计算环境中有效地实现分治法。软件验证是大数据生物信息学中另一个经常被忽视的重要问题。软件验证是确定被测程序是否完成其设计任务的过程。由于涉及大量输入空间和复杂算法，确定大数据生物信息学软件计算输出的正确性尤其困难。我们讨论了如何使用基于多次执行思想的最新软件测试技术（如蜕变测试）来实施有效的生物信息学质量保证策略。我们希望本综述能提高人们对生物信息学中这些关键问题的认识。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f880/5537105/7046801c6c37/gr1.jpg

相似文献

Scalability and Validation of Big Data Bioinformatics Software.大数据生物信息学软件的可扩展性与验证

Comput Struct Biotechnol J. 2017 Jul 20;15:379-386. doi: 10.1016/j.csbj.2017.07.002. eCollection 2017.

How to test bioinformatics software?如何测试生物信息学软件？

Biophys Rev. 2015 Sep;7(3):343-352. doi: 10.1007/s12551-015-0177-3. Epub 2015 Aug 13.

HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.HBLAST：并行化序列相似性——一种可通过Hadoop进行MapReduce的基本局部比对搜索工具。

J Biomed Inform. 2015 Apr;54:58-64. doi: 10.1016/j.jbi.2015.01.008. Epub 2015 Jan 24.

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics.分析基因组序列的大数据集：快速可扩展的 k-mer 统计信息收集。

BMC Bioinformatics. 2019 Apr 18;20(Suppl 4):138. doi: 10.1186/s12859-019-2694-8.

A distributed computing model for big data anonymization in the networks.一种用于网络大数据匿名化的分布式计算模型。

PLoS One. 2023 Apr 28;18(4):e0285212. doi: 10.1371/journal.pone.0285212. eCollection 2023.

Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends.MapReduce 编程框架在临床大数据分析中的应用：现状与未来趋势。

BioData Min. 2014 Oct 29;7:22. doi: 10.1186/1756-0381-7-22. eCollection 2014.

An innovative approach for testing bioinformatics programs using metamorphic testing.一种使用变质测试来测试生物信息学程序的创新方法。

BMC Bioinformatics. 2009 Jan 19;10:24. doi: 10.1186/1471-2105-10-24.

MaRe: Processing Big Data with application containers on Apache Spark.MaRe：在 Apache Spark 上使用应用程序容器处理大数据。

Gigascience. 2020 May 1;9(5). doi: 10.1093/gigascience/giaa042.

Towards testing big data analytics software: the essential role of metamorphic testing.迈向大数据分析软件测试：变质测试的重要作用。

Biophys Rev. 2019 Feb;11(1):123-125. doi: 10.1007/s12551-018-0492-6. Epub 2018 Dec 18.

Computational Strategies for Scalable Genomics Analysis.可扩展基因组分析的计算策略。

Genes (Basel). 2019 Dec 6;10(12):1017. doi: 10.3390/genes10121017.

引用本文的文献

Formal verification of bioinformatics software using model checking and theorem proving.使用模型检查和定理证明对生物信息学软件进行形式化验证。

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf383.

Intronomics-MIP: a snakemake pipeline for analyzing multilocus intron polymorphisms in species identification and population genomics.内含子组学-MIP：一种用于物种鉴定和群体基因组学中多位点内含子多态性分析的Snakemake工作流。

BMC Res Notes. 2025 May 6;18(1):203. doi: 10.1186/s13104-025-07264-6.

Galaxy as a gateway to bioinformatics: Multi-Interface Galaxy Hands-on Training Suite (MIGHTS) for scRNA-seq.作为生物信息学入门的Galaxy：用于单细胞RNA测序的多界面Galaxy实践培训套件（MIGHTS）

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giae107.

Mpox Discourse on Twitter by Sexual Minority Men and Gender-Diverse Individuals: Infodemiological Study Using BERTopic.Twitter 上性少数群体男性和性别多样化个体的猴痘相关讨论：使用 BERTopic 的信息流行病学研究

JMIR Public Health Surveill. 2024 Aug 13;10:e59193. doi: 10.2196/59193.

To metabolomics and beyond: a technological portfolio to investigate cancer metabolism.从代谢组学到更广阔的领域：研究癌症代谢的技术组合。

Signal Transduct Target Ther. 2023 Mar 22;8(1):137. doi: 10.1038/s41392-023-01380-0.

Developing a real-world database for oncology: a descriptive analysis of breast cancer in Argentina.建立肿瘤学真实世界数据库：阿根廷乳腺癌的描述性分析

Ecancermedicalscience. 2022 Aug 4;16:1435. doi: 10.3332/ecancer.2022.1435. eCollection 2022.

Assessing and assuring interoperability of a genomics file format.评估和确保基因组文件格式的互操作性。

Bioinformatics. 2022 Jun 27;38(13):3327-3336. doi: 10.1093/bioinformatics/btac327.

Improving bioinformatics software quality through incorporation of software engineering practices.通过融入软件工程实践提高生物信息学软件质量。

PeerJ Comput Sci. 2022 Jan 5;8:e839. doi: 10.7717/peerj-cs.839. eCollection 2022.

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers.使用生物信息学工作流管理器的可重复、可扩展且可共享的分析管道。

Nat Methods. 2021 Oct;18(10):1161-1168. doi: 10.1038/s41592-021-01254-9. Epub 2021 Sep 23.

Performance and scaling behavior of bioinformatic applications in virtualization environments to create awareness for the efficient use of compute resources.在虚拟化环境中创建生物信息学应用程序的性能和扩展行为，以提高对有效利用计算资源的认识。

PLoS Comput Biol. 2021 Jul 20;17(7):e1009244. doi: 10.1371/journal.pcbi.1009244. eCollection 2021 Jul.

本文引用的文献

How to test bioinformatics software?如何测试生物信息学软件？

Biophys Rev. 2015 Sep;7(3):343-352. doi: 10.1007/s12551-015-0177-3. Epub 2015 Aug 13.

Reproducibility of computational workflows is automated using continuous analysis.计算工作流程的可重复性通过持续分析实现自动化。

Nat Biotechnol. 2017 Apr;35(4):342-346. doi: 10.1038/nbt.3780. Epub 2017 Mar 13.

Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud.Falco：一个在云端快速且灵活的单细胞RNA测序处理框架。

Bioinformatics. 2017 Mar 1;33(5):767-769. doi: 10.1093/bioinformatics/btw732.

Simulation-based comprehensive benchmarking of RNA-seq aligners.基于模拟的RNA测序比对工具综合基准测试

Nat Methods. 2017 Feb;14(2):135-139. doi: 10.1038/nmeth.4106. Epub 2016 Dec 12.

Coming of age: ten years of next-generation sequencing technologies.成年：下一代测序技术的十年

Nat Rev Genet. 2016 May 17;17(6):333-51. doi: 10.1038/nrg.2016.49.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.SparkBWA：加速高通量DNA测序数据比对

PLoS One. 2016 May 16;11(5):e0155461. doi: 10.1371/journal.pone.0155461. eCollection 2016.

Data security in genomics: A review of Australian privacy requirements and their relation to cryptography in data storage.基因组学中的数据安全：澳大利亚隐私要求及其与数据存储中密码学关系的综述

J Pathol Inform. 2016 Feb 5;7:6. doi: 10.4103/2153-3539.175793. eCollection 2016.

Comparing five statistical methods of differential methylation identification using bisulfite sequencing data.使用亚硫酸氢盐测序数据比较五种差异甲基化识别的统计方法。

Stat Appl Genet Mol Biol. 2016 Apr;15(2):173-91. doi: 10.1515/sagmb-2015-0078.

Single-cell Transcriptome Study as Big Data.作为大数据的单细胞转录组研究

Genomics Proteomics Bioinformatics. 2016 Feb;14(1):21-30. doi: 10.1016/j.gpb.2016.01.005. Epub 2016 Feb 11.

Single-cell analysis tools for drug discovery and development.用于药物发现与开发的单细胞分析工具。

Nat Rev Drug Discov. 2016 Mar;15(3):204-16. doi: 10.1038/nrd.2015.16. Epub 2015 Dec 16.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

大数据生物信息学软件的可扩展性与验证

Scalability and Validation of Big Data Bioinformatics Software.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献