Suppr超能文献

准确估计下一代基因组测序中短读测序数据的映射质量。

Accurate estimation of short read mapping quality for next-generation genome sequencing.

机构信息

Department of Electrical Engineering & Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA.

出版信息

Bioinformatics. 2012 Sep 15;28(18):i349-i355. doi: 10.1093/bioinformatics/bts408.

Abstract

MOTIVATION

Several software tools specialize in the alignment of short next-generation sequencing reads to a reference sequence. Some of these tools report a mapping quality score for each alignment-in principle, this quality score tells researchers the likelihood that the alignment is correct. However, the reported mapping quality often correlates weakly with actual accuracy and the qualities of many mappings are underestimated, encouraging the researchers to discard correct mappings. Further, these low-quality mappings tend to correlate with variations in the genome (both single nucleotide and structural), and such mappings are important in accurately identifying genomic variants.

APPROACH

We develop a machine learning tool, LoQuM (LOgistic regression tool for calibrating the Quality of short read mappings, to assign reliable mapping quality scores to mappings of Illumina reads returned by any alignment tool. LoQuM uses statistics on the read (base quality scores reported by the sequencer) and the alignment (number of matches, mismatches and deletions, mapping quality score returned by the alignment tool, if available, and number of mappings) as features for classification and uses simulated reads to learn a logistic regression model that relates these features to actual mapping quality.

RESULTS

We test the predictions of LoQuM on an independent dataset generated by the ART short read simulation software and observe that LoQuM can 'resurrect' many mappings that are assigned zero quality scores by the alignment tools and are therefore likely to be discarded by researchers. We also observe that the recalibration of mapping quality scores greatly enhances the precision of called single nucleotide polymorphisms.

AVAILABILITY

LoQuM is available as open source at http://compbio.case.edu/loqum/.

CONTACT

matthew.ruffalo@case.edu.

摘要

动机

有几个软件工具专门用于将短的下一代测序读取与参考序列对齐。其中一些工具为每个比对报告一个比对质量评分——原则上,这个质量评分可以告诉研究人员比对正确的可能性。然而,报告的比对质量通常与实际准确性弱相关,并且许多比对的质量被低估,这鼓励研究人员丢弃正确的比对。此外,这些低质量的比对往往与基因组的变异(包括单核苷酸和结构变异)相关,这些比对对于准确识别基因组变异非常重要。

方法

我们开发了一种机器学习工具 LoQuM(用于校准短读映射质量的逻辑回归工具),用于为任何比对工具返回的 Illumina 读取的映射分配可靠的映射质量评分。LoQuM 使用读取(测序仪报告的碱基质量评分)和比对(匹配、错配和缺失的数量、比对工具返回的映射质量评分(如果有)以及映射的数量)的统计信息作为分类的特征,并使用模拟读取来学习一个逻辑回归模型,该模型将这些特征与实际的映射质量联系起来。

结果

我们在由 ART 短读模拟软件生成的独立数据集上测试了 LoQuM 的预测,观察到 LoQuM 可以“恢复”许多比对质量评分被比对工具分配为零的映射,因此很可能被研究人员丢弃。我们还观察到,重新校准映射质量评分大大提高了单核苷酸多态性的准确性。

可用性

LoQuM 可在 http://compbio.case.edu/loqum/ 上作为开源使用。

联系

matthew.ruffalo@case.edu

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a705/3436835/8eeac2005830/bts408f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验