基于序列的第三代测序读段质量评估新方法。

A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads.

机构信息

School of Information Science and Engineering, Central South University, Changsha 410083, China.

出版信息

Genes (Basel). 2019 Jan 14;10(1):44. doi: 10.3390/genes10010044.

DOI:10.3390/genes10010044

PMID:30646604

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6356754/

Abstract

The advent of third-generation sequencing (TGS) technologies, such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines, provides new possibilities for contig assembly, scaffolding, and high-performance computing in bioinformatics due to its long reads. However, the high error rate and poor quality of TGS reads provide new challenges for accurate genome assembly and long-read alignment. Efficient processing methods are in need to prioritize high-quality reads for improving the results of error correction and assembly. In this study, we proposed a novel Read Quality Evaluation and Selection Tool (REQUEST) for evaluating the quality of third-generation long reads. REQUEST generates training data of high-quality and low-quality reads which are characterized by their nucleotide combinations. A linear regression model was built to score the quality of reads. The method was tested on three datasets of different species. The results showed that the top-scored reads prioritized by REQUEST achieved higher alignment accuracies. The contig assembly results based on the top-scored reads also outperformed conventional approaches that use all reads. REQUEST is able to distinguish high-quality reads from low-quality ones without using reference genomes, making it a promising alternative sequence-quality evaluation method to alignment-based algorithms.

摘要

第三代测序（TGS）技术的出现，如 Pacific Biosciences（PacBio）和 Oxford Nanopore 机器，由于其长读长，为 contig 组装、支架和生物信息学中的高性能计算提供了新的可能性。然而，TGS 读长的高错误率和低质量为准确的基因组组装和长读对齐提供了新的挑战。需要有效的处理方法来优先考虑高质量的读长，以提高纠错和组装的结果。在这项研究中，我们提出了一种新的第三代长读质量评估和选择工具（REQUEST）来评估第三代长读的质量。REQUEST 生成了高质量和低质量读长的训练数据，其特征是核苷酸组合。建立了一个线性回归模型来对读长的质量进行评分。该方法在三个不同物种的数据集上进行了测试。结果表明，REQUEST 优先排序的得分最高的读长实现了更高的对齐精度。基于得分最高的读长的 contig 组装结果也优于使用所有读长的传统方法。REQUEST 能够在不使用参考基因组的情况下区分高质量读长和低质量读长，因此是一种有前途的替代基于对齐的算法的序列质量评估方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/314c/6356754/780029948b7c/genes-10-00044-g001.jpg

相似文献

A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads.基于序列的第三代测序读段质量评估新方法。

Genes (Basel). 2019 Jan 14;10(1):44. doi: 10.3390/genes10010044.

Evaluation of tools for long read RNA-seq splice-aware alignment.长读 RNA-seq 剪接感知比对工具评估。

Bioinformatics. 2018 Mar 1;34(5):748-754. doi: 10.1093/bioinformatics/btx668.

Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。

BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.

Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations.评估真核生物基因组的长读长从头组装工具：见解与考虑。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad100. Epub 2023 Nov 24.

Highly accurate long reads are crucial for realizing the potential of biodiversity genomics.高质量的长读长序列对于实现生物多样性基因组学的潜力至关重要。

BMC Genomics. 2023 Mar 16;24(1):117. doi: 10.1186/s12864-023-09193-9.

LRCstats, a tool for evaluating long reads correction methods.LRCstats，一种用于评估长读纠错方法的工具。

Bioinformatics. 2017 Nov 15;33(22):3652-3654. doi: 10.1093/bioinformatics/btx489.

Error analysis of the PacBio sequencing CCS reads.CCS 读段 PacBio 测序错误分析。

Int J Biostat. 2023 May 8;19(2):439-453. doi: 10.1515/ijb-2021-0091. eCollection 2023 Nov 1.

Improved assembly of noisy long reads by k-mer validation.通过k-mer验证改进嘈杂长读段的组装。

Genome Res. 2016 Dec;26(12):1710-1720. doi: 10.1101/gr.209247.116. Epub 2016 Oct 7.

Genome assembly using Nanopore-guided long and error-free DNA reads.使用纳米孔引导的长且无错误的DNA reads进行基因组组装。

BMC Genomics. 2015 Apr 20;16(1):327. doi: 10.1186/s12864-015-1519-z.

SLR: a scaffolding algorithm based on long reads and contig classification.SLR：一种基于长读段和重叠群分类的支架算法。

BMC Bioinformatics. 2019 Oct 30;20(1):539. doi: 10.1186/s12859-019-3114-9.

引用本文的文献

Comparative Transcriptome Analysis Unveils Regulatory Factors Influencing Fatty Liver Development in Lion-Head Geese under High-Intake Feeding Compared to Normal Feeding.比较转录组分析揭示了与正常饲养相比，高采食饲养下影响狮头鹅脂肪肝发育的调控因子。

Vet Sci. 2024 Aug 11;11(8):366. doi: 10.3390/vetsci11080366.

本文引用的文献

MEC: Misassembly Error Correction in contigs based on distribution of paired-end reads and statistics of GC-contents.MEC：基于双端读段分布和GC含量统计的重叠群错配错误校正

IEEE/ACM Trans Comput Biol Bioinform. 2018 Oct 18. doi: 10.1109/TCBB.2018.2876855.

SCOP: a novel scaffolding algorithm based on contig classification and optimization.SCOP：一种基于重叠群分类和优化的新型支架算法。

Bioinformatics. 2019 Apr 1;35(7):1142-1150. doi: 10.1093/bioinformatics/bty773.

Improving de novo Assembly Based on Read Classification.基于读段分类的从头组装改进。

IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):177-188. doi: 10.1109/TCBB.2018.2861380. Epub 2018 Jul 30.

A novel scaffolding algorithm based on contig error correction and path extension.一种基于重叠群纠错和路径扩展的新型支架搭建算法。

IEEE/ACM Trans Comput Biol Bioinform. 2018 Jul 20. doi: 10.1109/TCBB.2018.2858267.

GapReduce: a gap filling algorithm based on partitioned read sets.GapReduce：一种基于分区读集的缺口填充算法。

IEEE/ACM Trans Comput Biol Bioinform. 2018 Jan 5. doi: 10.1109/TCBB.2018.2789909.

Minimap2: pairwise alignment for nucleotide sequences.Minimap2：核苷酸序列的两两比对。

Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191.

Full Genome Sequence of the Western Reserve Strain of Vaccinia Virus Determined by Third-Generation Sequencing.通过第三代测序确定的痘苗病毒西储株全基因组序列

Genome Announc. 2018 Mar 15;6(11):e01570-17. doi: 10.1128/genomeA.01570-17.

Complete genomic and transcriptional landscape analysis using third-generation sequencing: a case study of Saccharomyces cerevisiae CEN.PK113-7D.利用第三代测序技术进行全基因组和转录组全景分析：以酿酒酵母 CEN.PK113-7D 为例。

Nucleic Acids Res. 2018 Apr 20;46(7):e38. doi: 10.1093/nar/gky014.

Genome Sequencing and Assembly by Long Reads in Plants.植物中长读长基因组测序与组装

Genes (Basel). 2017 Dec 28;9(1):6. doi: 10.3390/genes9010006.

MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads.MECAT：用于单分子测序读取的快速映射、错误纠正和从头组装。

Nat Methods. 2017 Nov;14(11):1072-1074. doi: 10.1038/nmeth.4432. Epub 2017 Sep 18.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于序列的第三代测序读段质量评估新方法。

A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献