Niaré Karamoko, Greenhouse Bryan, Bailey Jeffrey A
Brown University.
University of California San Francisco.
Res Sq. 2023 Feb 14:rs.3.rs-2561857. doi: 10.21203/rs.3.rs-2561857/v1.
Accurate variant calls from whole genome sequencing (WGS) of infections are crucial in malaria population genomics. Here we optimized a falciparum variant calling pipeline based on GATK version 4 (GATK4) and applied it to 6,626 public Illumina WGS samples. We optimized parameters that control the heterozygosity, local assembly region size, ploidy, mapping and base quality in both GATK HaplotypeCaller and GenotypeGVCFs leveraging control WGS and accurate PacBio assemblies of 10 laboratory strains. From these controls we generated a high-quality training dataset to recalibrate the raw variant data. On current high-quality samples (read length = 250bp, insert size = 405 - 524 bp ), we show improved sensitivity (86.6 ± 1.7% for SNPs and 82.2 ± 5.9% for indels) compared to the default GATK4 pipeline (77.7 ± 1.3% for SNPs; and 73.1 ± 5.1% for indels, adjusted P < 0.001) and previous variant calling with GATK version 3 (GATK3, 70.3 ± 3.0% for SNPs and 59.7 ± 5.8% for indels, adjusted P < 0.001). The sensitivity of our pipeline on simulated mixed infection samples (80.8 ± 6.1% for SNPs and 78.3 ± 5.1% for indels) was again improved relative to default GATK4 (68.8 ± 6.0% for SNPs and 38.9 ± 0.7% for indels, adjusted P < 0.001). Precision was high and comparable across all pipelines on each type of data tested. We further show that using the combination of high-quality SNPs and indels increases the resolution of local population population structure detection in sub-Saharan Africa. We finally demonstrate that increasing ploidy improves the detection of drug resistance mutations and estimation of complexity of infection. Overall, we provide an optimized GATK4 pipeline and resource for variant calling which should help improve genomic studies of malaria.
通过全基因组测序(WGS)准确鉴定感染的变异对于疟疾群体基因组学至关重要。在此,我们基于GATK版本4(GATK4)优化了恶性疟原虫变异鉴定流程,并将其应用于6626个公开的Illumina WGS样本。我们利用10个实验室菌株的对照WGS和精确的PacBio组装,在GATK HaplotypeCaller和GenotypeGVCFs中优化了控制杂合性、局部组装区域大小、倍性、映射和碱基质量的参数。从这些对照中,我们生成了一个高质量的训练数据集,用于重新校准原始变异数据。在当前的高质量样本(读长 = 250bp,插入片段大小 = 405 - 524bp)上,与默认的GATK4流程(SNP的灵敏度为77.7 ± 1.3%;indel的灵敏度为73.1 ± 5.1%,校正P < 0.001)以及之前使用GATK版本3(GATK3)进行的变异鉴定(SNP的灵敏度为70.3 ± 3.0%,indel的灵敏度为59.7 ± 5.8%,校正P < 0.001)相比,我们展示了更高的灵敏度(SNP为86.6 ± 1.7%,indel为82.2 ± 5.9%)。我们的流程在模拟混合感染样本上的灵敏度(SNP为80.8 ± 6.1%,indel为78.3 ± 5.1%)相对于默认的GATK4(SNP为68.8 ± 6.0%,indel为38.9 ± 0.7%,校正P < 0.001)再次得到提高。在测试的每种数据类型上,所有流程的精度都很高且相当。我们进一步表明,使用高质量的SNP和indel组合可提高撒哈拉以南非洲局部群体结构检测的分辨率。我们最终证明,增加倍性可改善耐药性突变的检测以及感染复杂性的估计。总体而言,我们提供了一个优化的GATK4变异鉴定流程和资源,这应有助于改进疟疾的基因组研究。