Department of Biology, University of Fribourg, Fribourg, Switzerland.
Swiss Institute of Bioinformatics, Fribourg, Switzerland.
Mol Ecol Resour. 2020 Jul;20(4):856-870. doi: 10.1111/1755-0998.13153. Epub 2020 Apr 6.
In non-model organisms, evolutionary questions are frequently addressed using reduced representation sequencing techniques due to their low cost, ease of use, and because they do not require genomic resources such as a reference genome. However, evidence is accumulating that such techniques may be affected by specific biases, questioning the accuracy of obtained genotypes, and as a consequence, their usefulness in evolutionary studies. Here, we introduce three strategies to estimate genotyping error rates from such data: through the comparison to high quality genotypes obtained with a different technique, from individual replicates, or from a population sample when assuming Hardy-Weinberg equilibrium. Applying these strategies to data obtained with Restriction site Associated DNA sequencing (RAD-seq), arguably the most popular reduced representation sequencing technique, revealed per-allele genotyping error rates that were much higher than sequencing error rates, particularly at heterozygous sites that were wrongly inferred as homozygous. As we exemplify through the inference of genome-wide and local ancestry of well characterized hybrids of two Eurasian poplar (Populus) species, such high error rates may lead to wrong biological conclusions. By properly accounting for these error rates in downstream analyses, either by incorporating genotyping errors directly or by recalibrating genotype likelihoods, we were nevertheless able to use the RAD-seq data to support biologically meaningful and robust inferences of ancestry among Populus hybrids. Based on these findings, we strongly recommend carefully assessing genotyping error rates in reduced representation sequencing experiments, and to properly account for these in downstream analyses, for instance using the tools presented here.
在非模式生物中,由于成本低、使用方便,并且不需要基因组资源(如参考基因组),因此经常使用简化代表性测序技术来解决进化问题。然而,越来越多的证据表明,这些技术可能受到特定偏差的影响,从而质疑获得基因型的准确性,并因此质疑它们在进化研究中的有用性。在这里,我们介绍了三种从这些数据估计基因分型错误率的策略:通过与使用不同技术获得的高质量基因型进行比较,通过个体重复,或在假设 Hardy-Weinberg 平衡时从群体样本中进行。将这些策略应用于通过限制性位点相关 DNA 测序(RAD-seq)获得的数据(可以说是最流行的简化代表性测序技术),揭示了每个等位基因的基因分型错误率远高于测序错误率,尤其是在杂合位点上,这些位点被错误地推断为纯合子。正如我们通过对两个欧亚杨树(杨树)物种的特征明显杂种的全基因组和局部祖先的推断所说明的那样,如此高的错误率可能导致错误的生物学结论。通过在下游分析中正确考虑这些错误率,无论是直接纳入基因分型错误还是重新校准基因型似然性,我们仍然能够使用 RAD-seq 数据来支持杨树杂种之间具有生物学意义和稳健的祖先推断。基于这些发现,我们强烈建议在简化代表性测序实验中仔细评估基因分型错误率,并在下游分析中正确考虑这些错误率,例如使用这里提供的工具。