Wang Ting-Hsuan, Huang Cheng-Ching, Hung Jui-Hung
Department of Computer Science, College of Computer Science, National Chiao Tung University, National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan.
Bioinformatics. 2021 Jul 27;37(13):1846-1852. doi: 10.1093/bioinformatics/btab025.
Cross-sample comparisons or large-scale meta-analyses based on the next generation sequencing (NGS) involve replicable and universal data preprocessing, including removing adapter fragments in contaminated reads (i.e. adapter trimming). While modern adapter trimmers require users to provide candidate adapter sequences for each sample, which are sometimes unavailable or falsely documented in the repositories (such as GEO or SRA), large-scale meta-analyses are therefore jeopardized by suboptimal adapter trimming.
Here we introduce a set of fast and accurate adapter detection and trimming algorithms that entail no a priori adapter sequences. These algorithms were implemented in modern C++ with SIMD and multithreading to accelerate its speed. Our experiments and benchmarks show that the implementation (i.e. EARRINGS), without being given any hint of adapter sequences, can reach comparable accuracy and higher throughput than that of existing adapter trimmers. EARRINGS is particularly useful in meta-analyses of a large batch of datasets and can be incorporated in any sequence analysis pipelines in all scales.
EARRINGS is open-source software and is available at https://github.com/jhhung/EARRINGS.
Supplementary data are available at Bioinformatics online.
基于下一代测序(NGS)的跨样本比较或大规模荟萃分析涉及可重复且通用的数据预处理,包括去除污染读段中的接头片段(即接头修剪)。虽然现代接头修剪工具要求用户为每个样本提供候选接头序列,但这些序列有时不可用或在数据库(如GEO或SRA)中记录错误,因此大规模荟萃分析会因接头修剪不理想而受到影响。
在此,我们介绍了一组快速且准确的接头检测和修剪算法,这些算法无需先验接头序列。这些算法用现代C++实现,并结合了SIMD和多线程技术以加速其运行速度。我们的实验和基准测试表明,该实现(即EARRINGS)在未给出任何接头序列提示的情况下,能够达到与现有接头修剪工具相当的准确性且具有更高的通量。EARRINGS在大量数据集的荟萃分析中特别有用,并且可以纳入任何规模的序列分析流程中。
EARRINGS是开源软件,可从https://github.com/jhhung/EARRINGS获取。
补充数据可在《生物信息学》在线获取。