准确估计下一代基因组测序中短读测序数据的映射质量。

Accurate estimation of short read mapping quality for next-generation genome sequencing.

机构信息

Department of Electrical Engineering & Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA.

出版信息

Bioinformatics. 2012 Sep 15;28(18):i349-i355. doi: 10.1093/bioinformatics/bts408.

DOI:10.1093/bioinformatics/bts408

PMID:22962451

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3436835/

Abstract

MOTIVATION

Several software tools specialize in the alignment of short next-generation sequencing reads to a reference sequence. Some of these tools report a mapping quality score for each alignment-in principle, this quality score tells researchers the likelihood that the alignment is correct. However, the reported mapping quality often correlates weakly with actual accuracy and the qualities of many mappings are underestimated, encouraging the researchers to discard correct mappings. Further, these low-quality mappings tend to correlate with variations in the genome (both single nucleotide and structural), and such mappings are important in accurately identifying genomic variants.

APPROACH

We develop a machine learning tool, LoQuM (LOgistic regression tool for calibrating the Quality of short read mappings, to assign reliable mapping quality scores to mappings of Illumina reads returned by any alignment tool. LoQuM uses statistics on the read (base quality scores reported by the sequencer) and the alignment (number of matches, mismatches and deletions, mapping quality score returned by the alignment tool, if available, and number of mappings) as features for classification and uses simulated reads to learn a logistic regression model that relates these features to actual mapping quality.

RESULTS

We test the predictions of LoQuM on an independent dataset generated by the ART short read simulation software and observe that LoQuM can 'resurrect' many mappings that are assigned zero quality scores by the alignment tools and are therefore likely to be discarded by researchers. We also observe that the recalibration of mapping quality scores greatly enhances the precision of called single nucleotide polymorphisms.

AVAILABILITY

LoQuM is available as open source at http://compbio.case.edu/loqum/.

CONTACT

matthew.ruffalo@case.edu.

摘要

动机

有几个软件工具专门用于将短的下一代测序读取与参考序列对齐。其中一些工具为每个比对报告一个比对质量评分——原则上，这个质量评分可以告诉研究人员比对正确的可能性。然而，报告的比对质量通常与实际准确性弱相关，并且许多比对的质量被低估，这鼓励研究人员丢弃正确的比对。此外，这些低质量的比对往往与基因组的变异（包括单核苷酸和结构变异）相关，这些比对对于准确识别基因组变异非常重要。

方法

我们开发了一种机器学习工具 LoQuM（用于校准短读映射质量的逻辑回归工具），用于为任何比对工具返回的 Illumina 读取的映射分配可靠的映射质量评分。LoQuM 使用读取（测序仪报告的碱基质量评分）和比对（匹配、错配和缺失的数量、比对工具返回的映射质量评分（如果有）以及映射的数量）的统计信息作为分类的特征，并使用模拟读取来学习一个逻辑回归模型，该模型将这些特征与实际的映射质量联系起来。

结果

我们在由 ART 短读模拟软件生成的独立数据集上测试了 LoQuM 的预测，观察到 LoQuM 可以“恢复”许多比对质量评分被比对工具分配为零的映射，因此很可能被研究人员丢弃。我们还观察到，重新校准映射质量评分大大提高了单核苷酸多态性的准确性。

可用性

LoQuM 可在 http://compbio.case.edu/loqum/ 上作为开源使用。

联系

matthew.ruffalo@case.edu。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a705/3436835/8eeac2005830/bts408f1.jpg

相似文献

Accurate estimation of short read mapping quality for next-generation genome sequencing.

Bioinformatics. 2012 Sep 15;28(18):i349-i355. doi: 10.1093/bioinformatics/bts408.

Re-alignment of the unmapped reads with base quality score.

BMC Bioinformatics. 2015;16 Suppl 5(Suppl 5):S8. doi: 10.1186/1471-2105-16-S5-S8. Epub 2015 Mar 18.

Comparative analysis of algorithms for next-generation sequencing read alignment.

Bioinformatics. 2011 Oct 15;27(20):2790-6. doi: 10.1093/bioinformatics/btr477. Epub 2011 Aug 19.

Ψ-RA: a parallel sparse index for genomic read alignment.

BMC Genomics. 2011;12 Suppl 2(Suppl 2):S7. doi: 10.1186/1471-2164-12-S2-S7. Epub 2011 Jul 27.

Fast and accurate read alignment for resequencing.

Bioinformatics. 2012 Sep 15;28(18):2366-73. doi: 10.1093/bioinformatics/bts450. Epub 2012 Jul 18.

SRmapper: a fast and sensitive genome-hashing alignment tool.

Bioinformatics. 2013 Feb 1;29(3):316-21. doi: 10.1093/bioinformatics/bts712. Epub 2012 Dec 24.

SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data.

BMC Bioinformatics. 2014 Feb 5;15:40. doi: 10.1186/1471-2105-15-40.

Long read alignment based on maximal exact match seeds.

Bioinformatics. 2012 Sep 15;28(18):i318-i324. doi: 10.1093/bioinformatics/bts414.

MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping.

PLoS One. 2014 Mar 5;9(3):e90581. doi: 10.1371/journal.pone.0090581. eCollection 2014.

Performance evaluation method for read mapping tool in clinical panel sequencing.

Genes Genomics. 2018;40(2):189-197. doi: 10.1007/s13258-017-0621-9. Epub 2017 Nov 9.

引用本文的文献

SigAlign: an alignment algorithm guided by explicit similarity criteria.

Nucleic Acids Res. 2024 Aug 27;52(15):8717-8733. doi: 10.1093/nar/gkae607.

Short-read aligner performance in germline variant identification.

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad480.

Full-Length Transcriptome Sequencing of the Scleractinian Coral Reveals the Gene Expression Profile of Coral-Zooxanthellae Holobiont.

Biology (Basel). 2021 Dec 5;10(12):1274. doi: 10.3390/biology10121274.

Genome-Wide Association Studies in Indian Buffalo Revealed Genomic Regions for Lactation and Fertility.

Front Genet. 2021 Sep 20;12:696109. doi: 10.3389/fgene.2021.696109. eCollection 2021.

Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data.

PeerJ. 2020 Dec 7;8:e10501. doi: 10.7717/peerj.10501. eCollection 2020.

Comparison of single-nucleotide variants identified by Illumina and Oxford Nanopore technologies in the context of a potential outbreak of Shiga toxin-producing Escherichia coli.

Gigascience. 2019 Aug 1;8(8). doi: 10.1093/gigascience/giz104.

Joint Estimates of Heterozygosity and Runs of Homozygosity for Modern and Ancient Samples.

Genetics. 2019 Jul;212(3):587-614. doi: 10.1534/genetics.119.302057. Epub 2019 May 14.

Identification of Genes Involved in Lipid Biosynthesis through de novo Transcriptome Assembly from Cocos nucifera Developing Endosperm.

Plant Cell Physiol. 2019 May 1;60(5):945-960. doi: 10.1093/pcp/pcy247.

A tandem simulation framework for predicting mapping quality.

Genome Biol. 2017 Aug 10;18(1):152. doi: 10.1186/s13059-017-1290-3.

Epigenomic profiling of primary gastric adenocarcinoma reveals super-enhancer heterogeneity.

Nat Commun. 2016 Sep 28;7:12983. doi: 10.1038/ncomms12983.

本文引用的文献

ART: a next-generation sequencing read simulator.

Bioinformatics. 2012 Feb 15;28(4):593-4. doi: 10.1093/bioinformatics/btr708. Epub 2011 Dec 23.

Comparative analysis of algorithms for next-generation sequencing read alignment.

Bioinformatics. 2011 Oct 15;27(20):2790-6. doi: 10.1093/bioinformatics/btr477. Epub 2011 Aug 19.

Sequencing breakthroughs for genomic ecology and evolutionary biology.

Mol Ecol Resour. 2008 Jan;8(1):3-17. doi: 10.1111/j.1471-8286.2007.02019.x.

Integrated analysis of gene expression, CpG island methylation, and gene copy number in breast cancer cells by deep sequencing.

PLoS One. 2011 Feb 25;6(2):e17490. doi: 10.1371/journal.pone.0017490.

Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA.

Genome Biol. 2010;11(10):R99. doi: 10.1186/gb-2010-11-10-r99. Epub 2010 Oct 8.

Advances in understanding cancer genomes through second-generation sequencing.

Nat Rev Genet. 2010 Oct;11(10):685-96. doi: 10.1038/nrg2841.

mrsFAST: a cache-oblivious algorithm for short-read mapping.

Nat Methods. 2010 Aug;7(8):576-7. doi: 10.1038/nmeth0810-576.

De novo assembly of human genomes with massively parallel short read sequencing.

Genome Res. 2010 Feb;20(2):265-72. doi: 10.1101/gr.097261.109. Epub 2009 Dec 17.

Personalized copy number and segmental duplication maps using next-generation sequencing.

Nat Genet. 2009 Oct;41(10):1061-7. doi: 10.1038/ng.437. Epub 2009 Aug 30.

The Sequence Alignment/Map format and SAMtools.

Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

准确估计下一代基因组测序中短读测序数据的映射质量。

Accurate estimation of short read mapping quality for next-generation genome sequencing.

机构信息

出版信息

MOTIVATION

APPROACH

RESULTS

AVAILABILITY

CONTACT

动机

方法

结果

可用性

联系

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献