Department of Electronic Systems and Information Processing, Faculty of Electrical Engineering and Computing, University of Zagreb, 10000 Zagreb, Croatia.
Département d'Ecologie et d'Evolution, Université de Lausanne, Quartier Sorge, 1015 Lausanne, Switzerland.
Bioinformatics. 2018 Mar 1;34(5):748-754. doi: 10.1093/bioinformatics/btx668.
High-throughput sequencing has transformed the study of gene expression levels through RNA-seq, a technique that is now routinely used by various fields, such as genetic research or diagnostics. The advent of third generation sequencing technologies providing significantly longer reads opens up new possibilities. However, the high error rates common to these technologies set new bioinformatics challenges for the gapped alignment of reads to their genomic origin. In this study, we have explored how currently available RNA-seq splice-aware alignment tools cope with increased read lengths and error rates. All tested tools were initially developed for short NGS reads, but some have claimed support for long Pacific Biosciences (PacBio) or even Oxford Nanopore Technologies (ONT) MinION reads.
The tools were tested on synthetic and real datasets from two technologies (PacBio and ONT MinION). Alignment quality and resource usage were compared across different aligners. The effect of error correction of long reads was explored, both using self-correction and correction with an external short reads dataset. A tool was developed for evaluating RNA-seq alignment results. This tool can be used to compare the alignment of simulated reads to their genomic origin, or to compare the alignment of real reads to a set of annotated transcripts. Our tests show that while some RNA-seq aligners were unable to cope with long error-prone reads, others produced overall good results. We further show that alignment accuracy can be improved using error-corrected reads.
https://github.com/kkrizanovic/RNAseqEval, https://figshare.com/projects/RNAseq_benchmark/24391.
Supplementary data are available at Bioinformatics online.
高通量测序通过 RNA-seq 改变了基因表达水平的研究,该技术现在已被遗传研究或诊断等各个领域常规使用。提供更长读长的第三代测序技术的出现开辟了新的可能性。然而,这些技术常见的高错误率为读取与基因组起源的缺口对齐提出了新的生物信息学挑战。在这项研究中,我们探讨了当前可用的 RNA-seq 剪接感知对齐工具如何应对增加的读长和错误率。所有测试的工具最初都是为短 NGS 读取开发的,但有些声称支持长 Pacific Biosciences (PacBio) 甚至 Oxford Nanopore Technologies (ONT) MinION 读取。
该工具在两种技术(PacBio 和 ONT MinION)的合成和真实数据集上进行了测试。比较了不同对齐器的对齐质量和资源使用情况。探讨了使用自纠错和使用外部短读取数据集进行纠错对长读取的影响。开发了一种用于评估 RNA-seq 对齐结果的工具。该工具可用于比较模拟读取与其基因组起源的对齐,或比较真实读取与一组注释转录本的对齐。我们的测试表明,虽然一些 RNA-seq 对齐器无法处理长易错读取,但其他对齐器总体上产生了良好的结果。我们进一步表明,使用纠错后的读取可以提高对齐准确性。
https://github.com/kkrizanovic/RNAseqEval,https://figshare.com/projects/RNAseq_benchmark/24391.
补充数据可在 Bioinformatics 在线获得。