Department of Electrical Engineering & Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA.
Bioinformatics. 2012 Sep 15;28(18):i349-i355. doi: 10.1093/bioinformatics/bts408.
MOTIVATION: Several software tools specialize in the alignment of short next-generation sequencing reads to a reference sequence. Some of these tools report a mapping quality score for each alignment-in principle, this quality score tells researchers the likelihood that the alignment is correct. However, the reported mapping quality often correlates weakly with actual accuracy and the qualities of many mappings are underestimated, encouraging the researchers to discard correct mappings. Further, these low-quality mappings tend to correlate with variations in the genome (both single nucleotide and structural), and such mappings are important in accurately identifying genomic variants. APPROACH: We develop a machine learning tool, LoQuM (LOgistic regression tool for calibrating the Quality of short read mappings, to assign reliable mapping quality scores to mappings of Illumina reads returned by any alignment tool. LoQuM uses statistics on the read (base quality scores reported by the sequencer) and the alignment (number of matches, mismatches and deletions, mapping quality score returned by the alignment tool, if available, and number of mappings) as features for classification and uses simulated reads to learn a logistic regression model that relates these features to actual mapping quality. RESULTS: We test the predictions of LoQuM on an independent dataset generated by the ART short read simulation software and observe that LoQuM can 'resurrect' many mappings that are assigned zero quality scores by the alignment tools and are therefore likely to be discarded by researchers. We also observe that the recalibration of mapping quality scores greatly enhances the precision of called single nucleotide polymorphisms. AVAILABILITY: LoQuM is available as open source at http://compbio.case.edu/loqum/. CONTACT: matthew.ruffalo@case.edu.
动机:有几个软件工具专门用于将短的下一代测序读取与参考序列对齐。其中一些工具为每个比对报告一个比对质量评分——原则上,这个质量评分可以告诉研究人员比对正确的可能性。然而,报告的比对质量通常与实际准确性弱相关,并且许多比对的质量被低估,这鼓励研究人员丢弃正确的比对。此外,这些低质量的比对往往与基因组的变异(包括单核苷酸和结构变异)相关,这些比对对于准确识别基因组变异非常重要。
方法:我们开发了一种机器学习工具 LoQuM(用于校准短读映射质量的逻辑回归工具),用于为任何比对工具返回的 Illumina 读取的映射分配可靠的映射质量评分。LoQuM 使用读取(测序仪报告的碱基质量评分)和比对(匹配、错配和缺失的数量、比对工具返回的映射质量评分(如果有)以及映射的数量)的统计信息作为分类的特征,并使用模拟读取来学习一个逻辑回归模型,该模型将这些特征与实际的映射质量联系起来。
结果:我们在由 ART 短读模拟软件生成的独立数据集上测试了 LoQuM 的预测,观察到 LoQuM 可以“恢复”许多比对质量评分被比对工具分配为零的映射,因此很可能被研究人员丢弃。我们还观察到,重新校准映射质量评分大大提高了单核苷酸多态性的准确性。
可用性:LoQuM 可在 http://compbio.case.edu/loqum/ 上作为开源使用。
Bioinformatics. 2012-9-15
BMC Bioinformatics. 2015
Bioinformatics. 2011-8-19
BMC Genomics. 2011-7-27
Bioinformatics. 2012-7-18
Bioinformatics. 2012-12-24
Bioinformatics. 2012-9-15
Nucleic Acids Res. 2024-8-27
Bioinformatics. 2023-8-1
Genome Biol. 2017-8-10
Bioinformatics. 2011-12-23
Bioinformatics. 2011-8-19
Mol Ecol Resour. 2008-1
Nat Rev Genet. 2010-10
Nat Methods. 2010-8
Genome Res. 2009-12-17
Bioinformatics. 2009-6-8