用于预处理Illumina下一代测序短读序列的软件。
Software for pre-processing Illumina next-generation sequencing short read sequences.
作者信息
Chen Chuming, Khaleel Sari S, Huang Hongzhan, Wu Cathy H
机构信息
Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE USA.
Geisel School of Medicine, Dartmouth College, Hanover, NH USA.
出版信息
Source Code Biol Med. 2014 May 3;9:8. doi: 10.1186/1751-0473-9-8. eCollection 2014.
BACKGROUND
When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets.
METHODS
We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7.
RESULTS
Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference-based assembly as measured by assembly contiguity and correctness.
CONCLUSIONS
Trimming of short read sequences can improve the quality of de novo and reference-based assembly and assembler performance. The parallel processing capability of ngsShoRT reduces trimming time and improves the memory efficiency when dealing with large datasets. We recommend combining sequencing artifacts removal, and quality score based read filtering and base trimming as the most consistent method for improving sequence quality and downstream assemblies. ngsShoRT source code, user guide and tutorial are available at http://research.bioinformatics.udel.edu/genomics/ngsShoRT/. ngsShoRT can be incorporated as a pre-processing step in genome and transcriptome assembly projects.
背景
与桑格测序技术相比,新一代测序(NGS)技术受到序列读长较短、碱基检出错误率较高、覆盖不均匀以及平台特异性测序假象的限制。这些特性通过引入可能导致数据解读错误的测序假象和错误,降低了其下游分析(如从头组装和基于参考的组装)的质量。尽管已经开发了许多用于NGS数据质量控制和预处理的工具,但它们都没有提供灵活且全面的修剪选项以及并行处理功能,以加快大型NGS数据集的预处理速度。
方法
我们开发了ngsShoRT(新一代测序短读长修剪器),这是一个用Perl编写的灵活且全面的开源软件包,它提供了一组常用于预处理NGS短读长序列的算法。我们将ngsShoRT的功能和性能与现有工具进行了比较:CutAdapt、NGS QC Toolkit和Trimmomatic。我们还比较了使用不同算法生成的预处理短读长序列对三种不同基因组(秀丽隐杆线虫、酿酒酵母S288c和大肠杆菌O157 H7)的从头组装和基于参考的组装的影响。
结果
在公开可用的Illumina GA II、HiSeq 2000以及MiSeq真核生物和细菌基因组短读长序列上测试了ngsShoRT算法的几种组合,重点是去除测序假象以及低质量读段和/或碱基。我们的结果表明,在三种生物和三个测序平台上,修剪提高了修剪后序列的平均质量得分。使用修剪后的序列进行从头组装和基于参考的组装提高了组装质量以及组装器性能。总体而言,就修剪速度以及从头组装和基于参考的组装的改进(通过组装连续性和正确性来衡量)而言,ngsShoRT优于可比的修剪工具。
结论
短读长序列的修剪可以提高从头组装和基于参考的组装的质量以及组装器性能。ngsShoRT的并行处理能力减少了修剪时间,并在处理大型数据集时提高了内存效率。我们建议将去除测序假象、基于质量得分的读段过滤和碱基修剪相结合,作为提高序列质量和下游组装的最一致方法。ngsShoRT的源代码、用户指南和教程可在http://research.bioinformatics.udel.edu/genomics/ngsShoRT/获取。ngsShoRT可以作为基因组和转录组组装项目中的预处理步骤。
相似文献
Source Code Biol Med. 2014-5-3
PeerJ. 2021-9-6
BMC Genomics. 2016-8-31
PLoS One. 2011-10-19
引用本文的文献
Open Forum Infect Dis. 2025-7-15
BMC Bioinformatics. 2025-3-7
NPJ Biofilms Microbiomes. 2024-12-19
Mitochondrial DNA B Resour. 2024-12-8
Mitochondrial DNA B Resour. 2024-9-30
Mitochondrial DNA B Resour. 2024-9-12
Microbiol Resour Announc. 2024-10-10
本文引用的文献
Bioinformatics. 2014-4-1
Bioinformatics. 2012-8-22
Genome Res. 2012-1-6
Genome Res. 2011-9-16
J Microbiol Methods. 2011-7-3
Comp Biochem Physiol C Toxicol Pharmacol. 2011-6-1
Nucleic Acids Res. 2011-5-16
Nat Rev Genet. 2011-3-1