Skolkovo Institute of Science and Technology, Skolkovo, Russia.
Institute for Information Transmission Problems (Kharkevich Institute), Russian Academy of Sciences, Moscow, Russia.
Eur J Hum Genet. 2020 Nov;28(11):1615-1623. doi: 10.1038/s41431-020-0697-6. Epub 2020 Jul 29.
High-throughput sequencing of fetal DNA is a promising and increasingly common method for the discovery of all (or all coding) genetic variants in the fetus, either as part of prenatal screening or diagnosis, or for genetic diagnosis of spontaneous abortions. In many cases, the fetal DNA (from chorionic villi, amniotic fluid, or abortive tissue) can be contaminated with maternal cells, resulting in the mixture of fetal and maternal DNA. This maternal cell contamination (MCC) undermines the assumption, made by traditional variant callers, that each allele in a heterozygous site is covered, on average, by 50% of the reads, and therefore can lead to erroneous genotype calls. We present a panel of methods for reducing the genotyping error in the presence of MCC. All methods start with the output of GATK HaplotypeCaller on the sequencing data for the (contaminated) fetal sample and both of its parents, and additionally rely on information about the MCC fraction (which itself is readily estimated from the high-throughput sequencing data). The first of these methods uses a Bayesian probabilistic model to correct the fetal genotype calls produced by MCC-unaware HaplotypeCaller. The other two methods "learn" the genotype-correction model from examples. We use simulated contaminated fetal data to train and test the models. Using the test sets, we show that all three methods lead to substantially improved accuracy when compared with the original MCC-unaware HaplotypeCaller calls. We then apply the best-performing method to three chorionic villus samples from spontaneously terminated pregnancies.
高通量测序的胎儿 DNA 是一种很有前途的和日益普遍的方法,发现所有(或所有编码)的遗传变异在胎儿,无论是作为产前筛查或诊断的一部分,或为自然流产的遗传诊断。在许多情况下,胎儿的 DNA (从绒毛膜,羊水,或流产组织)可能与母体细胞的污染,导致胎儿和母体 DNA 的混合物。这种母细胞污染(MCC)破坏了传统的变异调用者的假设,即每个等位基因在杂合位点平均由 50%的读数,因此可能导致错误的基因型调用。我们提出了一个面板的方法来减少在 MCC 的存在下的基因分型错误。所有的方法都从测序数据的 GATK HaplotypeCaller 的输出开始,对(污染)胎儿样本及其父母双方,并且还依赖于关于 MCC 分数的信息(其本身很容易从高通量测序数据估计)。这些方法中的第一种方法使用贝叶斯概率模型来校正 MCC 不知情的 HaplotypeCaller 产生的胎儿基因型调用。其他两种方法“从例子中学习”基因型校正模型。我们使用模拟污染的胎儿数据来训练和测试模型。使用测试集,我们表明,与原始的 MCC 不知情的 HaplotypeCaller 调用相比,所有三种方法都显著提高了准确性。然后,我们将性能最佳的方法应用于三个来自自然终止妊娠的绒毛膜活检样本。