seqQscorer：使用机器学习进行下一代测序数据的自动化质量控制。

seqQscorer: automated quality control of next-generation sequencing data using machine learning.

机构信息

Johannes Gutenberg-Universität Mainz, Biozentrum I, Hans-Dieter-Hüsch-Weg 15, 55128, Mainz, Germany.

出版信息

Genome Biol. 2021 Mar 5;22(1):75. doi: 10.1186/s13059-021-02294-2.

DOI:10.1186/s13059-021-02294-2

PMID:33673854

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7934511/

Abstract

Controlling quality of next-generation sequencing (NGS) data files is a necessary but complex task. To address this problem, we statistically characterize common NGS quality features and develop a novel quality control procedure involving tree-based and deep learning classification algorithms. Predictive models, validated on internal and external functional genomics datasets, are to some extent generalizable to data from unseen species. The derived statistical guidelines and predictive models represent a valuable resource for users of NGS data to better understand quality issues and perform automatic quality control. Our guidelines and software are available at https://github.com/salbrec/seqQscorer .

摘要

控制下一代测序（NGS）数据文件的质量是一项必要但复杂的任务。为了解决这个问题，我们对常见的 NGS 质量特征进行了统计描述，并开发了一种新的质量控制程序，涉及基于树的和深度学习分类算法。在内部和外部功能基因组学数据集上进行验证的预测模型在一定程度上可以推广到来自未知物种的数据。得出的统计指南和预测模型为 NGS 数据的用户提供了有价值的资源，以更好地理解质量问题并执行自动质量控制。我们的指南和软件可在 https://github.com/salbrec/seqQscorer 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5085/7934511/3cdf5f686284/13059_2021_2294_Fig1_HTML.jpg

相似文献

seqQscorer: automated quality control of next-generation sequencing data using machine learning.

Genome Biol. 2021 Mar 5;22(1):75. doi: 10.1186/s13059-021-02294-2.

NGS-QC Generator: A Quality Control System for ChIP-Seq and Related Deep Sequencing-Generated Datasets.

Methods Mol Biol. 2016;1418:243-65. doi: 10.1007/978-1-4939-3578-9_13.

NEAT: a framework for building fully automated NGS pipelines and analyses.

BMC Bioinformatics. 2016 Feb 1;17:53. doi: 10.1186/s12859-016-0902-3.

Machine learning random forest for predicting oncosomatic variant NGS analysis.

Sci Rep. 2021 Nov 8;11(1):21820. doi: 10.1038/s41598-021-01253-y.

SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing.

BMC Genomics. 2016 Nov 14;17(1):912. doi: 10.1186/s12864-016-3281-2.

Novel bioinformatics quality control metric for next-generation sequencing experiments in the clinical context.

Nucleic Acids Res. 2019 Dec 2;47(21):e135. doi: 10.1093/nar/gkz775.

Using R and Bioconductor in Clinical Genomics and Transcriptomics.

J Mol Diagn. 2020 Jan;22(1):3-20. doi: 10.1016/j.jmoldx.2019.08.006. Epub 2019 Oct 9.

Rapid evaluation and quality control of next generation sequencing data with FaQCs.

BMC Bioinformatics. 2014 Nov 19;15(1):366. doi: 10.1186/s12859-014-0366-2.

Statistical guidelines for quality control of next-generation sequencing techniques.

Life Sci Alliance. 2021 Aug 30;4(11). doi: 10.26508/lsa.202101113. Print 2021 Nov.

nPhase: an accurate and contiguous phasing method for polyploids.

Genome Biol. 2021 Apr 29;22(1):126. doi: 10.1186/s13059-021-02342-x.

引用本文的文献

Advancing genome-based precision medicine: a review on machine learning applications for rare genetic disorders.

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf329.

Integration of Bulk RNA-seq Pipeline Metrics for Assessing Low-Quality Samples.

Res Sq. 2025 Jul 3:rs.3.rs-6976695. doi: 10.21203/rs.3.rs-6976695/v1.

Assessing and mitigating batch effects in large-scale omics studies.

Genome Biol. 2024 Oct 3;25(1):254. doi: 10.1186/s13059-024-03401-9.

Overlooked poor-quality patient samples in sequencing data impair reproducibility of published clinically relevant datasets.

Genome Biol. 2024 Aug 16;25(1):222. doi: 10.1186/s13059-024-03331-6.

Identification of key biomarkers and associated pathways of pancreatic cancer using integrated transcriptomic and gene network analysis.

Saudi J Biol Sci. 2023 Nov;30(11):103819. doi: 10.1016/j.sjbs.2023.103819. Epub 2023 Sep 26.

Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality.

BMC Bioinformatics. 2022 Jul 14;23(Suppl 6):279. doi: 10.1186/s12859-022-04775-y.

A quality control portal for sequencing data deposited at the European genome-phenome archive.

Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac136.

Statistical guidelines for quality control of next-generation sequencing techniques.

Life Sci Alliance. 2021 Aug 30;4(11). doi: 10.26508/lsa.202101113. Print 2021 Nov.

本文引用的文献

RASflow: an RNA-Seq analysis workflow with Snakemake.

BMC Bioinformatics. 2020 Mar 18;21(1):110. doi: 10.1186/s12859-020-3433-x.

ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest.

PLoS Comput Biol. 2019 Dec 18;15(12):e1007556. doi: 10.1371/journal.pcbi.1007556. eCollection 2019 Dec.

To Trim or Not to Trim: Effects of Read Trimming on the De Novo Genome Assembly of a Widespread East Asian Passerine, the Rufous-Capped Babbler ( Blyth).

Genes (Basel). 2019 Sep 23;10(10):737. doi: 10.3390/genes10100737.

Hepatic transcriptome signatures in patients with varying degrees of nonalcoholic fatty liver disease compared with healthy normal-weight individuals.

Am J Physiol Gastrointest Liver Physiol. 2019 Apr 1;316(4):G462-G472. doi: 10.1152/ajpgi.00358.2018. Epub 2019 Jan 17.

The Encyclopedia of DNA elements (ENCODE): data portal update.

Nucleic Acids Res. 2018 Jan 4;46(D1):D794-D801. doi: 10.1093/nar/gkx1081.

FQC Dashboard: integrates FastQC results into a web-based, interactive, and extensible FASTQ quality control tool.

Bioinformatics. 2017 Oct 1;33(19):3137-3139. doi: 10.1093/bioinformatics/btx373. Epub 2017 Jun 9.

Salmon provides fast and bias-aware quantification of transcript expression.

Nat Methods. 2017 Apr;14(4):417-419. doi: 10.1038/nmeth.4197. Epub 2017 Mar 6.

Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse.

Nucleic Acids Res. 2017 Jan 4;45(D1):D658-D662. doi: 10.1093/nar/gkw983. Epub 2016 Oct 26.

ChiLin: a comprehensive ChIP-seq and DNase-seq quality control and analysis pipeline.

BMC Bioinformatics. 2016 Oct 3;17(1):404. doi: 10.1186/s12859-016-1274-4.

MultiQC: summarize analysis results for multiple tools and samples in a single report.

Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

seqQscorer：使用机器学习进行下一代测序数据的自动化质量控制。

seqQscorer: automated quality control of next-generation sequencing data using machine learning.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

seqQscorer：使用机器学习进行下一代测序数据的自动化质量控制。

seqQscorer: automated quality control of next-generation sequencing data using machine learning.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献