Yang Andrian, Troup Michael, Ho Joshua W K
Victor Chang Cardiac Research Institute, Darlinghurst, NSW 2010, Australia.
St. Vincent's Clinical School, University of New South Wales, Darlinghurst, NSW 2010, Australia.
Comput Struct Biotechnol J. 2017 Jul 20;15:379-386. doi: 10.1016/j.csbj.2017.07.002. eCollection 2017.
This review examines two important aspects that are central to modern big data bioinformatics analysis - software scalability and validity. We argue that not only are the issues of scalability and validation common to all big data bioinformatics analyses, they can be tackled by conceptually related methodological approaches, namely divide-and-conquer (scalability) and multiple executions (validation). Scalability is defined as the ability for a program to scale based on workload. It has always been an important consideration when developing bioinformatics algorithms and programs. Nonetheless the surge of volume and variety of biological and biomedical data has posed new challenges. We discuss how modern cloud computing and big data programming frameworks such as MapReduce and Spark are being used to effectively implement divide-and-conquer in a distributed computing environment. Validation of software is another important issue in big data bioinformatics that is often ignored. Software validation is the process of determining whether the program under test fulfils the task for which it was designed. Determining the correctness of the computational output of big data bioinformatics software is especially difficult due to the large input space and complex algorithms involved. We discuss how state-of-the-art software testing techniques that are based on the idea of multiple executions, such as metamorphic testing, can be used to implement an effective bioinformatics quality assurance strategy. We hope this review will raise awareness of these critical issues in bioinformatics.
本综述探讨了现代大数据生物信息学分析的两个核心重要方面——软件可扩展性和有效性。我们认为,可扩展性和验证问题不仅是所有大数据生物信息学分析所共有的,而且可以通过概念上相关的方法来解决,即分治法(可扩展性)和多次执行(验证)。可扩展性被定义为程序根据工作量进行扩展的能力。在开发生物信息学算法和程序时,它一直是一个重要的考虑因素。尽管如此,生物和生物医学数据的数量和种类激增带来了新的挑战。我们讨论了如何利用现代云计算和大数据编程框架(如MapReduce和Spark)在分布式计算环境中有效地实现分治法。软件验证是大数据生物信息学中另一个经常被忽视的重要问题。软件验证是确定被测程序是否完成其设计任务的过程。由于涉及大量输入空间和复杂算法,确定大数据生物信息学软件计算输出的正确性尤其困难。我们讨论了如何使用基于多次执行思想的最新软件测试技术(如蜕变测试)来实施有效的生物信息学质量保证策略。我们希望本综述能提高人们对生物信息学中这些关键问题的认识。