Suppr超能文献

评估和确保基因组文件格式的互操作性。

Assessing and assuring interoperability of a genomics file format.

机构信息

Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 1L7, Canada.

Department of Medical Biophysics, University of Toronto, Toronto, ON M5G 1L7, Canada.

出版信息

Bioinformatics. 2022 Jun 27;38(13):3327-3336. doi: 10.1093/bioinformatics/btac327.

Abstract

MOTIVATION

Bioinformatics software tools operate largely through the use of specialized genomics file formats. Often these formats lack formal specification, making it difficult or impossible for the creators of these tools to robustly test them for correct handling of input and output. This causes problems in interoperability between different tools that, at best, wastes time and frustrates users. At worst, interoperability issues could lead to undetected errors in scientific results.

RESULTS

We developed a new verification system, Acidbio, which tests for correct behavior in bioinformatics software packages. We crafted tests to unify correct behavior when tools encounter various edge cases-potentially unexpected inputs that exemplify the limits of the format. To analyze the performance of existing software, we tested the input validation of 80 Bioconda packages that parsed the Browser Extensible Data (BED) format. We also used a fuzzing approach to automatically perform additional testing. Of 80 software packages examined, 75 achieved less than 70% correctness on our test suite. We categorized multiple root causes for the poor performance of different types of software. Fuzzing detected other errors that the manually designed test suite could not. We also created a badge system that developers can use to indicate more precisely which BED variants their software accepts and to advertise the software's performance on the test suite.

AVAILABILITY AND IMPLEMENTATION

Acidbio is available at https://github.com/hoffmangroup/acidbio.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

生物信息学软件工具主要通过使用专门的基因组学文件格式来运行。通常,这些格式缺乏正式的规范,这使得这些工具的创建者难以对其进行稳健的测试,以确保正确处理输入和输出。这导致不同工具之间的互操作性出现问题,在最好的情况下,这会浪费时间和使用户感到沮丧。在最坏的情况下,互操作性问题可能导致科学结果中未被发现的错误。

结果

我们开发了一种新的验证系统 Acidbio,用于测试生物信息学软件包中的正确行为。我们精心设计了测试用例,以统一工具在遇到各种边缘情况(潜在的意外输入)时的正确行为,这些边缘情况代表了格式的极限。为了分析现有软件的性能,我们测试了 80 个解析 Browser Extensible Data(BED)格式的 Bioconda 软件包的输入验证。我们还使用模糊测试方法自动执行额外的测试。在检查的 80 个软件包中,有 75 个在我们的测试套件中正确性低于 70%。我们对不同类型软件的性能不佳进行了多种根本原因的分类。模糊测试检测到了手动设计的测试套件无法检测到的其他错误。我们还创建了一个徽章系统,开发人员可以使用该系统更精确地指示其软件接受的 BED 变体,并宣传其软件在测试套件上的性能。

可用性和实现

Acidbio 可在 https://github.com/hoffmangroup/acidbio 上获得。

补充信息

补充数据可在 Bioinformatics 在线获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验