Department of Biochemistry, University of Otago, Dunedin, New Zealand.
Biomolecular Interaction Centre, University of Canterbury, Christchurch, New Zealand.
Genome Biol. 2022 Feb 16;23(1):56. doi: 10.1186/s13059-022-02625-x.
Computational biology provides software tools for testing and making inferences about biological data. In the face of increasing volumes of data, heuristic methods that trade software speed for accuracy may be employed. We have studied these trade-offs using the results of a large number of independent software benchmarks, and evaluated whether external factors, including speed, author reputation, journal impact, recency and developer efforts, are indicative of accurate software.
We find that software speed, author reputation, journal impact, number of citations and age are unreliable predictors of software accuracy. This is unfortunate because these are frequently cited reasons for selecting software tools. However, GitHub-derived statistics and high version numbers show that accurate bioinformatic software tools are generally the product of many improvements over time. We also find an excess of slow and inaccurate bioinformatic software tools, and this is consistent across many sub-disciplines. There are few tools that are middle-of-road in terms of accuracy and speed trade-offs.
Our findings indicate that accurate bioinformatic software is primarily the product of long-term commitments to software development. In addition, we hypothesise that bioinformatics software suffers from publication bias. Software that is intermediate in terms of both speed and accuracy may be difficult to publish-possibly due to author, editor and reviewer practises. This leaves an unfortunate hole in the literature, as ideal tools may fall into this gap. High accuracy tools are not always useful if they are slow, while high speed is not useful if the results are also inaccurate.
计算生物学为测试和推断生物数据提供了软件工具。面对越来越多的数据,可能会采用以软件速度换取准确性的启发式方法。我们使用大量独立软件基准测试的结果研究了这些权衡,并评估了外部因素(包括速度、作者声誉、期刊影响力、时效性和开发者努力)是否能准确反映软件的情况。
我们发现软件速度、作者声誉、期刊影响力、引用次数和年龄都不能可靠地预测软件的准确性。这很不幸,因为这些因素经常被用来选择软件工具。然而,GitHub 衍生的统计数据和高版本号表明,准确的生物信息学软件工具通常是随着时间的推移不断改进的产物。我们还发现,缓慢而不准确的生物信息学软件工具过多,而且这种情况在许多子学科中都存在。在准确性和速度权衡方面,处于中等水平的工具很少。
我们的研究结果表明,准确的生物信息学软件主要是长期致力于软件开发的产物。此外,我们假设生物信息学软件存在发表偏倚。在速度和准确性方面都处于中等水平的软件可能难以发表——这可能是由于作者、编辑和审稿人的实践造成的。这使得文献中留下了一个不幸的空白,因为理想的工具可能会落入这个空白。如果速度较慢,那么高精度工具并不总是有用的,而如果结果也不准确,那么高速度也没有用。