School of Information Science and Engineering, Central South University, Changsha 410083, China.
Genes (Basel). 2019 Jan 14;10(1):44. doi: 10.3390/genes10010044.
The advent of third-generation sequencing (TGS) technologies, such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines, provides new possibilities for contig assembly, scaffolding, and high-performance computing in bioinformatics due to its long reads. However, the high error rate and poor quality of TGS reads provide new challenges for accurate genome assembly and long-read alignment. Efficient processing methods are in need to prioritize high-quality reads for improving the results of error correction and assembly. In this study, we proposed a novel Read Quality Evaluation and Selection Tool (REQUEST) for evaluating the quality of third-generation long reads. REQUEST generates training data of high-quality and low-quality reads which are characterized by their nucleotide combinations. A linear regression model was built to score the quality of reads. The method was tested on three datasets of different species. The results showed that the top-scored reads prioritized by REQUEST achieved higher alignment accuracies. The contig assembly results based on the top-scored reads also outperformed conventional approaches that use all reads. REQUEST is able to distinguish high-quality reads from low-quality ones without using reference genomes, making it a promising alternative sequence-quality evaluation method to alignment-based algorithms.
第三代测序(TGS)技术的出现,如 Pacific Biosciences(PacBio)和 Oxford Nanopore 机器,由于其长读长,为 contig 组装、支架和生物信息学中的高性能计算提供了新的可能性。然而,TGS 读长的高错误率和低质量为准确的基因组组装和长读对齐提供了新的挑战。需要有效的处理方法来优先考虑高质量的读长,以提高纠错和组装的结果。在这项研究中,我们提出了一种新的第三代长读质量评估和选择工具(REQUEST)来评估第三代长读的质量。REQUEST 生成了高质量和低质量读长的训练数据,其特征是核苷酸组合。建立了一个线性回归模型来对读长的质量进行评分。该方法在三个不同物种的数据集上进行了测试。结果表明,REQUEST 优先排序的得分最高的读长实现了更高的对齐精度。基于得分最高的读长的 contig 组装结果也优于使用所有读长的传统方法。REQUEST 能够在不使用参考基因组的情况下区分高质量读长和低质量读长,因此是一种有前途的替代基于对齐的算法的序列质量评估方法。