Department of Animal Behavior, University of Bielefeld, Postfach 100131, 33615, Bielefeld, Germany.
British Antarctic Survey, High Cross, Madingley Road, Cambridge, CB3 OET, UK.
BMC Genomics. 2019 Jan 22;20(1):72. doi: 10.1186/s12864-019-5440-8.
Restriction site-associated DNA sequencing (RADseq) has revolutionized the study of wild organisms by allowing cost-effective genotyping of thousands of loci. However, for species lacking reference genomes, it can be challenging to select the restriction enzyme that offers the best balance between the number of obtained RAD loci and depth of coverage, which is crucial for a successful outcome. To address this issue, PredRAD was recently developed, which uses probabilistic models to predict restriction site frequencies from a transcriptome assembly or other sequence resource based on either GC content or mono-, di- or trinucleotide composition. This program generates predictions that are broadly consistent with estimates of the true number of restriction sites obtained through in silico digestion of available reference genome assemblies. However, in practice the actual number of loci obtained could potentially differ as incomplete enzymatic digestion or patchy sequence coverage across the genome might lead to some loci not being represented in a RAD dataset, while erroneous assembly could potentially inflate the number of loci. To investigate this, we used genome and transcriptome assemblies together with RADseq data from the Antarctic fur seal (Arctocephalus gazella) to compare PredRAD predictions with empirical estimates of the number of loci obtained via in silico digestion and from de novo assemblies.
PredRAD yielded consistently higher predicted numbers of restriction sites for the transcriptome assembly relative to the genome assembly. The trinucleotide and dinucleotide models also predicted higher frequencies than the mononucleotide or GC content models. Overall, the dinucleotide and trinucleotide models applied to the transcriptome and the genome assemblies respectively generated predictions that were closest to the number of restriction sites estimated by in silico digestion. Furthermore, the number of de novo assembled RAD loci mapping to restriction sites was similar to the expectation based on in silico digestion.
Our study reveals generally high concordance between PredRAD predictions and empirical estimates of the number of RAD loci. This further supports the utility of PredRAD, while also suggesting that it may be feasible to sequence and assemble the majority of RAD loci present in an organism's genome.
限制酶相关 DNA 测序(RADseq)通过允许对数千个基因座进行经济高效的基因分型,彻底改变了对野生生物的研究。然而,对于缺乏参考基因组的物种来说,选择既能获得最多 RAD 基因座数量又能保证覆盖深度的限制酶是具有挑战性的,而这对成功的结果至关重要。为了解决这个问题,最近开发了 PredRAD,它使用概率模型根据 GC 含量或单、二或三核苷酸组成,从转录组组装或其他序列资源中预测限制酶位点频率。该程序生成的预测结果与通过对现有参考基因组组装进行计算机消化获得的真实限制酶位点数量的估计大致一致。然而,在实践中,实际获得的基因座数量可能会有所不同,因为不完全的酶消化或基因组覆盖不均匀可能导致某些基因座在 RAD 数据集上没有被代表,而错误的组装可能会潜在地增加基因座数量。为了研究这一点,我们使用基因组和转录组组装以及来自南极毛皮海豹(Arctocephalus gazella)的 RADseq 数据,将 PredRAD 预测与通过计算机消化获得的基因座数量的经验估计值以及从头组装进行比较。
PredRAD 对转录组组装的预测结果始终高于对基因组组装的预测结果,三核苷酸和二核苷酸模型的预测频率也高于单核苷酸或 GC 含量模型。总体而言,分别应用于转录组和基因组组装的二核苷酸和三核苷酸模型生成的预测结果与通过计算机消化估计的限制酶位点数量最接近。此外,从头组装的 RAD 基因座映射到限制酶位点的数量与基于计算机消化的预期数量相似。
我们的研究表明,PredRAD 预测与 RAD 基因座数量的经验估计之间存在高度一致性。这进一步支持了 PredRAD 的实用性,同时也表明,对生物体基因组中存在的大多数 RAD 基因座进行测序和组装是可行的。