Cline Eliot, Wisittipanit Nuttachat, Boongoen Tossapon, Chukeatirote Ekachai, Struss Darush, Eungwanichayapant Anant
School of Science, Mae Fah Luang University, Amphur Muang, Chiang Rai, Thailand.
Department of Biotechnology, East West Seed Company, San Sai, Chiang Mai, Thailand.
PeerJ. 2020 Dec 7;8:e10501. doi: 10.7717/peerj.10501. eCollection 2020.
Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files. Each alignment has a mapping quality (MAPQ) score indicating the probability a read is incorrectly aligned. This study investigated the recalibration of probability estimates used to compute MAPQ scores for improving variant calling performance in single-sample, low-coverage settings.
Simulated tomato, hot pepper and rice genomes were implanted with known variants. From these, simulated paired-end reads were generated at low coverage and aligned to the original reference genomes. Features extracted from the SAM formatted alignment files for tomato were used to train machine learning models to detect incorrectly aligned reads and output estimates of the probability of misalignment for each read in all three data sets. MAPQ scores were then re-computed from these estimates. Next, the SAM files were updated with new MAPQ scores. Finally, Variant calling was performed on the original and recalibrated alignments and the results compared.
Incorrectly aligned reads comprised only 0.16% of the reads in the training set. This severe class imbalance required special consideration for model training. The F1 score for detecting misaligned reads ranged from 0.76 to 0.82. The best performing model was used to compute new MAPQ scores. Single Nucleotide Polymorphism (SNP) detection was improved after mapping score recalibration. In rice, recall for called SNPs increased by 5.2%, while for tomato and pepper it increased by 3.1% and 1.5%, respectively. For all three data sets the precision of SNP calls ranged from 0.91 to 0.95, and was largely unchanged both before and after mapping score recalibration.
Recalibrating MAPQ scores delivers modest improvements in single-sample variant calling results. Some variant callers operate on multiple samples simultaneously. They exploit every sample's reads to compensate for the low read-depth of individual samples. This improves polymorphism detection and genotype inference. It may be that small improvements in single-sample settings translate to larger gains in a multi-sample experiment. A study to investigate this is ongoing.
低覆盖度测序是一种获取覆盖整个基因组读数的经济有效方法。然而,每个位点的读数深度较低,使得测序错误难以与实际变异区分开来。在进行变异检测之前,测序仪读数会与参考基因组进行比对,比对结果存储在序列比对/映射(SAM)文件中。每个比对都有一个映射质量(MAPQ)分数,表明读数错误比对的概率。本研究调查了用于计算MAPQ分数的概率估计值的重新校准,以提高单样本、低覆盖度情况下的变异检测性能。
在模拟的番茄、辣椒和水稻基因组中植入已知变异。由此,以低覆盖度生成模拟双端读数,并与原始参考基因组进行比对。从番茄的SAM格式比对文件中提取的特征用于训练机器学习模型,以检测错误比对的读数,并输出所有三个数据集中每个读数的错误比对概率估计值。然后根据这些估计值重新计算MAPQ分数。接下来,用新的MAPQ分数更新SAM文件。最后,对原始比对和重新校准后的比对进行变异检测,并比较结果。
在训练集中,错误比对的读数仅占读数的0.16%。这种严重的类别不平衡在模型训练时需要特别考虑。检测错误比对读数的F1分数在0.76至0.82之间。使用性能最佳的模型计算新的MAPQ分数。映射分数重新校准后,单核苷酸多态性(SNP)检测得到改善。在水稻中,检测到的SNP召回率提高了5.2%,而在番茄和辣椒中分别提高了3.1%和1.5%。对于所有三个数据集,SNP检测的精确率在0.91至0.95之间,在映射分数重新校准前后基本不变。
重新校准MAPQ分数在单样本变异检测结果方面有适度改善。一些变异检测工具同时对多个样本进行操作。它们利用每个样本的读数来补偿单个样本的低读数深度。这改善了多态性检测和基因型推断。可能在单样本情况下的小改进在多样本实验中会转化为更大的收益。一项对此进行研究的工作正在进行中。