Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.
Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany.
Mol Biol Evol. 2018 May 1;35(5):1037-1046. doi: 10.1093/molbev/msy014.
With Next Generation Sequencing data being routinely used, evolutionary biology is transforming into a computational science. Thus, researchers have to rely on a growing number of increasingly complex software. All widely used core tools in the field have grown considerably, in terms of the number of features as well as lines of code and consequently, also with respect to software complexity. A topic that has received little attention is the software engineering quality of widely used core analysis tools. Software developers appear to rarely assess the quality of their code, and this can have potential negative consequences for end-users. To this end, we assessed the code quality of 16 highly cited and compute-intensive tools mainly written in C/C++ (e.g., MrBayes, MAFFT, SweepFinder, etc.) and JAVA (BEAST) from the broader area of evolutionary biology that are being routinely used in current data analysis pipelines. Because, the software engineering quality of the tools we analyzed is rather unsatisfying, we provide a list of best practices for improving the quality of existing tools and list techniques that can be deployed for developing reliable, high quality scientific software from scratch. Finally, we also discuss journal as well as science policy and, more importantly, funding issues that need to be addressed for improving software engineering quality as well as ensuring support for developing new and maintaining existing software. Our intention is to raise the awareness of the community regarding software engineering quality issues and to emphasize the substantial lack of funding for scientific software development.
随着下一代测序数据的常规使用,进化生物学正在转变为一门计算科学。因此,研究人员必须依赖于越来越多的、日益复杂的软件。该领域中所有广泛使用的核心工具,无论是在功能数量、代码行数方面,还是在软件复杂度方面,都有了显著的增长。一个鲜少受到关注的话题是广泛使用的核心分析工具的软件工程质量。软件开发人员似乎很少评估他们代码的质量,而这可能会对最终用户产生潜在的负面影响。为此,我们评估了 16 个高度引用和计算密集型工具的代码质量,这些工具主要用 C/C++(例如 MrBayes、MAFFT、SweepFinder 等)和 JAVA(BEAST)编写,来自进化生物学更广泛的领域,这些工具在当前数据分析管道中被常规使用。因为我们分析的工具的软件工程质量相当不尽如人意,所以我们提供了一些最佳实践,用于提高现有工具的质量,并列出了可用于从头开始开发可靠、高质量科学软件的技术。最后,我们还讨论了期刊以及科学政策,更重要的是,需要解决的资金问题,以提高软件工程质量,并确保对开发新工具和维护现有工具的支持。我们的目的是提高社区对软件工程质量问题的认识,并强调科学软件开发的资金严重短缺。