Waldmann Jost, Gerken Jan, Hankeln Wolfgang, Schweer Timmy, Glöckner Frank Oliver
Microbial Genomics and Bioinformatics Research Group, Max Planck Institute for Marine Microbiology, Celsiusstrasse 1, 28359 Bremen, Germany.
BMC Res Notes. 2014 Jun 14;7:365. doi: 10.1186/1756-0500-7-365.
Advances in sequencing technologies challenge the efficient importing and validation of FASTA formatted sequence data which is still a prerequisite for most bioinformatic tools and pipelines. Comparative analysis of commonly used Bio*-frameworks (BioPerl, BioJava and Biopython) shows that their scalability and accuracy is hampered.
FastaValidator represents a platform-independent, standardized, light-weight software library written in the Java programming language. It targets computer scientists and bioinformaticians writing software which needs to parse quickly and accurately large amounts of sequence data. For end-users FastaValidator includes an interactive out-of-the-box validation of FASTA formatted files, as well as a non-interactive mode designed for high-throughput validation in software pipelines.
The accuracy and performance of the FastaValidator library qualifies it for large data sets such as those commonly produced by massive parallel (NGS) technologies. It offers scientists a fast, accurate and standardized method for parsing and validating FASTA formatted sequence data.
测序技术的进步对FASTA格式序列数据的高效导入和验证提出了挑战,而FASTA格式序列数据仍是大多数生物信息学工具和流程的前提条件。对常用的Bio*框架(BioPerl、BioJava和Biopython)的比较分析表明,它们的可扩展性和准确性受到了阻碍。
FastaValidator是一个用Java编程语言编写的独立于平台的、标准化的轻量级软件库。它面向编写需要快速准确解析大量序列数据的软件的计算机科学家和生物信息学家。对于终端用户,FastaValidator包括对FASTA格式文件的开箱即用的交互式验证,以及为软件流程中的高通量验证设计的非交互式模式。
FastaValidator库的准确性和性能使其适用于大规模并行(NGS)技术通常产生的大型数据集。它为科学家提供了一种快速、准确和标准化的方法来解析和验证FASTA格式的序列数据。