• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

重新校准Illumina短读长比对中的映射质量分数可改善低覆盖度测序数据中的单核苷酸多态性(SNP)检测结果。

Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data.

作者信息

Cline Eliot, Wisittipanit Nuttachat, Boongoen Tossapon, Chukeatirote Ekachai, Struss Darush, Eungwanichayapant Anant

机构信息

School of Science, Mae Fah Luang University, Amphur Muang, Chiang Rai, Thailand.

Department of Biotechnology, East West Seed Company, San Sai, Chiang Mai, Thailand.

出版信息

PeerJ. 2020 Dec 7;8:e10501. doi: 10.7717/peerj.10501. eCollection 2020.

DOI:10.7717/peerj.10501
PMID:33354434
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7727374/
Abstract

BACKGROUND

Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files. Each alignment has a mapping quality (MAPQ) score indicating the probability a read is incorrectly aligned. This study investigated the recalibration of probability estimates used to compute MAPQ scores for improving variant calling performance in single-sample, low-coverage settings.

MATERIALS AND METHODS

Simulated tomato, hot pepper and rice genomes were implanted with known variants. From these, simulated paired-end reads were generated at low coverage and aligned to the original reference genomes. Features extracted from the SAM formatted alignment files for tomato were used to train machine learning models to detect incorrectly aligned reads and output estimates of the probability of misalignment for each read in all three data sets. MAPQ scores were then re-computed from these estimates. Next, the SAM files were updated with new MAPQ scores. Finally, Variant calling was performed on the original and recalibrated alignments and the results compared.

RESULTS

Incorrectly aligned reads comprised only 0.16% of the reads in the training set. This severe class imbalance required special consideration for model training. The F1 score for detecting misaligned reads ranged from 0.76 to 0.82. The best performing model was used to compute new MAPQ scores. Single Nucleotide Polymorphism (SNP) detection was improved after mapping score recalibration. In rice, recall for called SNPs increased by 5.2%, while for tomato and pepper it increased by 3.1% and 1.5%, respectively. For all three data sets the precision of SNP calls ranged from 0.91 to 0.95, and was largely unchanged both before and after mapping score recalibration.

CONCLUSION

Recalibrating MAPQ scores delivers modest improvements in single-sample variant calling results. Some variant callers operate on multiple samples simultaneously. They exploit every sample's reads to compensate for the low read-depth of individual samples. This improves polymorphism detection and genotype inference. It may be that small improvements in single-sample settings translate to larger gains in a multi-sample experiment. A study to investigate this is ongoing.

摘要

背景

低覆盖度测序是一种获取覆盖整个基因组读数的经济有效方法。然而,每个位点的读数深度较低,使得测序错误难以与实际变异区分开来。在进行变异检测之前,测序仪读数会与参考基因组进行比对,比对结果存储在序列比对/映射(SAM)文件中。每个比对都有一个映射质量(MAPQ)分数,表明读数错误比对的概率。本研究调查了用于计算MAPQ分数的概率估计值的重新校准,以提高单样本、低覆盖度情况下的变异检测性能。

材料与方法

在模拟的番茄、辣椒和水稻基因组中植入已知变异。由此,以低覆盖度生成模拟双端读数,并与原始参考基因组进行比对。从番茄的SAM格式比对文件中提取的特征用于训练机器学习模型,以检测错误比对的读数,并输出所有三个数据集中每个读数的错误比对概率估计值。然后根据这些估计值重新计算MAPQ分数。接下来,用新的MAPQ分数更新SAM文件。最后,对原始比对和重新校准后的比对进行变异检测,并比较结果。

结果

在训练集中,错误比对的读数仅占读数的0.16%。这种严重的类别不平衡在模型训练时需要特别考虑。检测错误比对读数的F1分数在0.76至0.82之间。使用性能最佳的模型计算新的MAPQ分数。映射分数重新校准后,单核苷酸多态性(SNP)检测得到改善。在水稻中,检测到的SNP召回率提高了5.2%,而在番茄和辣椒中分别提高了3.1%和1.5%。对于所有三个数据集,SNP检测的精确率在0.91至0.95之间,在映射分数重新校准前后基本不变。

结论

重新校准MAPQ分数在单样本变异检测结果方面有适度改善。一些变异检测工具同时对多个样本进行操作。它们利用每个样本的读数来补偿单个样本的低读数深度。这改善了多态性检测和基因型推断。可能在单样本情况下的小改进在多样本实验中会转化为更大的收益。一项对此进行研究的工作正在进行中。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac6/7727374/e136aa7be169/peerj-08-10501-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac6/7727374/3aa1ccaa943a/peerj-08-10501-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac6/7727374/2b9e452258c1/peerj-08-10501-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac6/7727374/89dba814710b/peerj-08-10501-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac6/7727374/fdb712ed7ad1/peerj-08-10501-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac6/7727374/0ab850c61443/peerj-08-10501-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac6/7727374/e136aa7be169/peerj-08-10501-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac6/7727374/3aa1ccaa943a/peerj-08-10501-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac6/7727374/2b9e452258c1/peerj-08-10501-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac6/7727374/89dba814710b/peerj-08-10501-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac6/7727374/fdb712ed7ad1/peerj-08-10501-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac6/7727374/0ab850c61443/peerj-08-10501-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac6/7727374/e136aa7be169/peerj-08-10501-g006.jpg

相似文献

1
Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data.重新校准Illumina短读长比对中的映射质量分数可改善低覆盖度测序数据中的单核苷酸多态性(SNP)检测结果。
PeerJ. 2020 Dec 7;8:e10501. doi: 10.7717/peerj.10501. eCollection 2020.
2
ComB: SNP calling and mapping analysis for color and nucleotide space platforms.ComB:用于颜色和核苷酸空间平台的单核苷酸多态性(SNP)检测与定位分析
J Comput Biol. 2011 Jun;18(6):795-807. doi: 10.1089/cmb.2011.0027. Epub 2011 May 12.
3
Coval: improving alignment quality and variant calling accuracy for next-generation sequencing data.Coval:提高下一代测序数据的比对质量和变异调用准确性。
PLoS One. 2013 Oct 8;8(10):e75402. doi: 10.1371/journal.pone.0075402. eCollection 2013.
4
Impact of post-alignment processing in variant discovery from whole exome data.全外显子数据变异发现中比对后处理的影响
BMC Bioinformatics. 2016 Oct 3;17(1):403. doi: 10.1186/s12859-016-1279-z.
5
Fast alignment of reads to a variation graph with application to SNP detection.快速将读取内容与变异图谱对齐,应用于 SNP 检测。
J Integr Bioinform. 2021 Nov 16;18(4):20210032. doi: 10.1515/jib-2021-0032.
6
Read trimming has minimal effect on bacterial SNP-calling accuracy.reads 修剪对细菌 SNP 调用准确性的影响最小。
Microb Genom. 2020 Dec;6(12). doi: 10.1099/mgen.0.000434. Epub 2020 Dec 11.
7
An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.利用来自小型真核生物基因组的模拟读数对单核苷酸多态性假阳性原因的调查。
BMC Bioinformatics. 2015 Nov 11;16:382. doi: 10.1186/s12859-015-0801-z.
8
Assessing single nucleotide variant detection and genotype calling on whole-genome sequenced individuals.评估全基因组测序个体中单核苷酸变异检测和基因型调用。
Bioinformatics. 2014 Jun 15;30(12):1707-13. doi: 10.1093/bioinformatics/btu067. Epub 2014 Feb 19.
9
Accurate estimation of short read mapping quality for next-generation genome sequencing.准确估计下一代基因组测序中短读测序数据的映射质量。
Bioinformatics. 2012 Sep 15;28(18):i349-i355. doi: 10.1093/bioinformatics/bts408.
10
Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data.确保从 Illumina 测序数据中准确进行基因型和单核苷酸多态性(SNP)calling 的步骤。
BMC Genomics. 2012;13 Suppl 8(Suppl 8):S8. doi: 10.1186/1471-2164-13-S8-S8. Epub 2012 Dec 17.

引用本文的文献

1
Long-Read MDM4 Sequencing Reveals Aberrant Isoform Landscape in Metastatic Melanomas.长读 MDM4 测序揭示转移性黑色素瘤中异常的异构体景观。
Int J Mol Sci. 2024 Aug 30;25(17):9415. doi: 10.3390/ijms25179415.
2
Short-read aligner performance in germline variant identification.短读比对工具在种系变异识别中的性能表现。
Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad480.
3
Strobealign: flexible seed size enables ultra-fast and accurate read alignment.Strobealign:灵活的种子大小可实现超快速和准确的读取对齐。

本文引用的文献

1
Genotyping-by-sequencing and SNP-arrays are complementary for detecting quantitative trait loci by tagging different haplotypes in association studies.测序基因分型和 SNP 芯片在关联研究中通过标记不同的单倍型来检测数量性状基因座是互补的。
BMC Plant Biol. 2019 Jul 16;19(1):318. doi: 10.1186/s12870-019-1926-4.
2
A chromosome-scale genome assembly of cucumber (Cucumis sativus L.).黄瓜染色体级别的基因组组装。
Gigascience. 2019 Jun 1;8(6). doi: 10.1093/gigascience/giz072.
3
Resequencing of Capsicum annuum parental lines (YCM334 and Taean) for the genetic analysis of bacterial wilt resistance.
Genome Biol. 2022 Dec 15;23(1):260. doi: 10.1186/s13059-022-02831-7.
对辣椒亲本系(YCM334和泰安)进行重测序,以分析青枯病抗性的遗传特性。
BMC Plant Biol. 2016 Oct 28;16(1):235. doi: 10.1186/s12870-016-0931-0.
4
SAMSVM: A tool for misalignment filtration of SAM-format sequences with support vector machine.SAMSVM:一种利用支持向量机对SAM格式序列进行错配过滤的工具。
J Bioinform Comput Biol. 2015 Dec;13(6):1550025. doi: 10.1142/S0219720015500250. Epub 2015 Aug 24.
5
VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications.VarSim:一个用于癌症相关高通量基因组测序的高保真模拟与验证框架。
Bioinformatics. 2015 May 1;31(9):1469-71. doi: 10.1093/bioinformatics/btu828. Epub 2014 Dec 17.
6
Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing.通过全基因组测序探索番茄(番茄属番茄组)进化枝中的遗传变异。
Plant J. 2014 Oct;80(1):136-48. doi: 10.1111/tpj.12616. Epub 2014 Sep 3.
7
Genome sequence of the hot pepper provides insights into the evolution of pungency in Capsicum species.辣椒基因组序列为辣椒属植物辣味进化提供了线索。
Nat Genet. 2014 Mar;46(3):270-8. doi: 10.1038/ng.2877. Epub 2014 Jan 19.
8
Sequencing depth and coverage: key considerations in genomic analyses.测序深度和覆盖度:基因组分析中的关键考虑因素。
Nat Rev Genet. 2014 Feb;15(2):121-32. doi: 10.1038/nrg3642.
9
Comparing a few SNP calling algorithms using low-coverage sequencing data.比较几种使用低覆盖度测序数据的 SNP calling 算法。
BMC Bioinformatics. 2013 Sep 17;14:274. doi: 10.1186/1471-2105-14-274.
10
Accurate estimation of short read mapping quality for next-generation genome sequencing.准确估计下一代基因组测序中短读测序数据的映射质量。
Bioinformatics. 2012 Sep 15;28(18):i349-i355. doi: 10.1093/bioinformatics/bts408.