Genetics Dep., Luiz de Queiroz College of Agriculture, Univ. of São Paulo, Av. Pádua Dias, 11, C. P. 9, 13.418-900, Piracicaba, São Paulo, Brazil.
Embrapa Beef Cattle, Av. Rádio Maia, 830, Zona Rural, 79.106-550, Campo Grande, Mato Grosso do Sul, Brazil.
Plant Genome. 2019 Nov;12(3):1-9. doi: 10.3835/plantgenome2019.01.0002.
Introduced concept of expected genotype quality (EGQ) and software to calculate it Provided read depth guidelines for GBS in tetraploids Developed software to generate diploidized genotype calls from VCF files Demonstrated value of aligning GBS reads to a mock reference genome for SNP discovery Recommend filtering based on GQ and read depth to prevent genotype bias Although genotyping-by-sequencing (GBS) is a well-established marker technology in diploids, the development of best practices for tetraploid species is a topic of current research. We determined the theoretical relationship between read depth and the phred-scaled probability of genotype misclassification conditioned on the true genotype, which we call expected genotype quality (EGQ). If the GBS method has 0.5% allelic error, then 17 reads are needed to classify simplex tetraploids as heterozygous with 95% accuracy (EGQ = 13) vs. 61 reads to determine allele dosage. We developed an R script to convert tetraploid GBS data in variant call format (VCF) into diploidized genotype calls and applied it to 267 interspecific hybrids of the tetraploid forage grass Urochloa. When reads were aligned to a mock reference genome created from GBS data of the Urochloa brizantha (Hochst. ex A. Rich.) R. D. Webster cultivar Marandu, 25,678 biallelic single nucleotide polymorphism (SNPs) were discovered, compared with ∼3000 SNPs when aligning to the closest true reference genomes, Setaria viridis (L.) P. Beauv. and S. italica (L.) P. Beauv. Cross-validation revealed that missing genotypes were imputed by the random forest method with a median accuracy of 0.85 regardless of heterozygote frequency. Using the Urochloa spp. hybrids, we illustrated how filtering samples based only on genotype quality (GQ) creates genotype bias; a depth threshold based on EGQ is also needed regardless of whether genotypes are called using a diploidized or allele dosage model.
引入了预期基因型质量(EGQ)的概念和计算它的软件,为四倍体 GBS 提供了读深指南。开发了从 VCF 文件生成二倍体基因型调用的软件,展示了将 GBS 读取与模拟参考基因组对齐以发现 SNP 的价值,推荐基于 GQ 和读深过滤以防止基因型偏差。虽然基于测序的基因分型(GBS)在二倍体中是一种成熟的标记技术,但四倍体物种的最佳实践的发展是当前研究的一个主题。我们确定了读深与条件基因型下基因型错误分类的 phred 标度概率之间的理论关系,我们称之为预期基因型质量(EGQ)。如果 GBS 方法的等位基因错误率为 0.5%,则需要 17 个读取来以 95%的准确度将单倍体四倍体分类为杂合子(EGQ=13),而需要 61 个读取来确定等位基因剂量。我们开发了一个 R 脚本,将变体调用格式(VCF)中的四倍体 GBS 数据转换为二倍体基因型调用,并将其应用于 267 个四倍体饲料草 Urochloa 的种间杂种。当读取与从 Urochloa brizantha(Hochst. ex A. Rich.)R. D. Webster 栽培品种 Marandu 的 GBS 数据创建的模拟参考基因组对齐时,发现了 25678 个双等位基因单核苷酸多态性(SNP),而与最接近的真实参考基因组对齐时,发现了约 3000 个 SNP。交叉验证表明,无论杂合频率如何,随机森林方法都可以以中位数准确度 0.85 来估算缺失的基因型。使用 Urochloa spp.杂种,我们说明了仅基于基因型质量(GQ)过滤样本如何产生基因型偏差;无论使用二倍体化还是等位基因剂量模型调用基因型,都需要基于 EGQ 的深度阈值。