FastaValidator：一个用于解析和验证FASTA格式序列的开源Java库。

FastaValidator: an open-source Java library to parse and validate FASTA formatted sequences.

作者信息

Waldmann Jost, Gerken Jan, Hankeln Wolfgang, Schweer Timmy, Glöckner Frank Oliver

机构信息

Microbial Genomics and Bioinformatics Research Group, Max Planck Institute for Marine Microbiology, Celsiusstrasse 1, 28359 Bremen, Germany.

出版信息

BMC Res Notes. 2014 Jun 14;7:365. doi: 10.1186/1756-0500-7-365.

DOI:10.1186/1756-0500-7-365

PMID:24929426

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4094456/

Abstract

BACKGROUND

Advances in sequencing technologies challenge the efficient importing and validation of FASTA formatted sequence data which is still a prerequisite for most bioinformatic tools and pipelines. Comparative analysis of commonly used Bio*-frameworks (BioPerl, BioJava and Biopython) shows that their scalability and accuracy is hampered.

FINDINGS

FastaValidator represents a platform-independent, standardized, light-weight software library written in the Java programming language. It targets computer scientists and bioinformaticians writing software which needs to parse quickly and accurately large amounts of sequence data. For end-users FastaValidator includes an interactive out-of-the-box validation of FASTA formatted files, as well as a non-interactive mode designed for high-throughput validation in software pipelines.

CONCLUSIONS

The accuracy and performance of the FastaValidator library qualifies it for large data sets such as those commonly produced by massive parallel (NGS) technologies. It offers scientists a fast, accurate and standardized method for parsing and validating FASTA formatted sequence data.

摘要

背景

测序技术的进步对FASTA格式序列数据的高效导入和验证提出了挑战，而FASTA格式序列数据仍是大多数生物信息学工具和流程的前提条件。对常用的Bio*框架（BioPerl、BioJava和Biopython）的比较分析表明，它们的可扩展性和准确性受到了阻碍。

研究结果

FastaValidator是一个用Java编程语言编写的独立于平台的、标准化的轻量级软件库。它面向编写需要快速准确解析大量序列数据的软件的计算机科学家和生物信息学家。对于终端用户，FastaValidator包括对FASTA格式文件的开箱即用的交互式验证，以及为软件流程中的高通量验证设计的非交互式模式。