Department of Creative Informatics, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, 113-0033, Japan.
Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, 113-0033, Japan.
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad031. Epub 2023 May 8.
Reproducibility of data analysis workflow is a key issue in the field of bioinformatics. Recent computing technologies, such as virtualization, have made it possible to reproduce workflow execution with ease. However, the reproducibility of results is not well discussed; that is, there is no standard way to verify whether the biological interpretation of reproduced results is the same. Therefore, it still remains a challenge to automatically evaluate the reproducibility of results.
We propose a new metric, a reproducibility scale of workflow execution results, to evaluate the reproducibility of results. This metric is based on the idea of evaluating the reproducibility of results using biological feature values (e.g., number of reads, mapping rate, and variant frequency) representing their biological interpretation. We also implemented a prototype system that automatically evaluates the reproducibility of results using the proposed metric. To demonstrate our approach, we conducted an experiment using workflows used by researchers in real research projects and the use cases that are frequently encountered in the field of bioinformatics.
Our approach enables automatic evaluation of the reproducibility of results using a fine-grained scale. By introducing our approach, it is possible to evolve from a binary view of whether the results are superficially identical or not to a more graduated view. We believe that our approach will contribute to more informed discussion on reproducibility in bioinformatics.
数据分析工作流程的可重复性是生物信息学领域的一个关键问题。最近的计算技术,如虚拟化,使得轻松重现工作流程执行成为可能。然而,结果的可重复性并没有得到很好的讨论;也就是说,没有标准的方法来验证重现结果的生物学解释是否相同。因此,自动评估结果的可重复性仍然是一个挑战。
我们提出了一种新的度量标准,即工作流程执行结果的可重现性尺度,用于评估结果的可重现性。该度量标准基于使用代表其生物学解释的生物学特征值(例如,读取次数、映射率和变体频率)评估结果可重现性的思想。我们还实现了一个原型系统,该系统使用所提出的度量标准自动评估结果的可重现性。为了验证我们的方法,我们使用研究人员在实际研究项目中使用的工作流程和生物信息学领域中经常遇到的用例进行了实验。
我们的方法通过引入更细粒度的尺度,实现了使用生物信息学方法自动评估结果可重现性的能力。通过引入我们的方法,可以从结果是否表面上相同的二元视角转变为更渐进的视角。我们相信,我们的方法将有助于更深入地讨论生物信息学中的可重复性问题。