Warr Amanda, Robert Christelle, Hume David, Archibald Alan L, Deeb Nader, Watson Mick
Division of Genetics and Genomics, The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh Edinburgh, UK.
Genus plc., Hendersonville TN, USA.
Front Genet. 2015 Nov 27;6:338. doi: 10.3389/fgene.2015.00338. eCollection 2015.
Many applications of high throughput sequencing rely on the availability of an accurate reference genome. Variant calling often produces large data sets that cannot be realistically validated and which may contain large numbers of false-positives. Errors in the reference assembly increase the number of false-positives. While resources are available to aid in the filtering of variants from human data, for other species these do not yet exist and strict filtering techniques must be employed which are more likely to exclude true-positives. This work assesses the accuracy of the pig reference genome (Sscrofa10.2) using whole genome sequencing reads from the Duroc sow whose genome the assembly was based on. Indicators of structural variation including high regional coverage, unexpected insert sizes, improper pairing and homozygous variants were used to identify low quality (LQ) regions of the assembly. Low coverage (LC) regions were also identified and analyzed separately. The LQ regions covered 13.85% of the genome, the LC regions covered 26.6% of the genome and combined (LQLC) they covered 33.07% of the genome. Over half of dbSNP variants were located in the LQLC regions. Of copy number variable regions identified in a previous study, 86.3% were located in the LQLC regions. The regions were also enriched for gene predictions from RNA-seq data with 42.98% falling in the LQLC regions. Excluding variants in the LQ, LC, or LQLC from future analyses will help reduce the number of false-positive variant calls. Researchers using WGS data should be aware that the current pig reference genome does not give an accurate representation of the copy number of alleles in the original Duroc sow's genome.
高通量测序的许多应用都依赖于准确的参考基因组。变异检测通常会产生大量数据集,这些数据集难以实际验证,并且可能包含大量假阳性。参考组装中的错误会增加假阳性的数量。虽然有资源可用于帮助从人类数据中过滤变异,但对于其他物种,这些资源尚不存在,因此必须采用严格的过滤技术,而这些技术更有可能排除真阳性。这项工作使用来自杜洛克母猪的全基因组测序读数评估猪参考基因组(Sscrofa10.2)的准确性,该基因组组装基于该母猪的基因组。包括高区域覆盖率、意外插入大小、配对不当和纯合变异在内的结构变异指标被用于识别组装的低质量(LQ)区域。低覆盖率(LC)区域也被识别并单独分析。LQ区域覆盖了基因组的13.85%,LC区域覆盖了基因组的26.6%,两者合并(LQLC)覆盖了基因组的33.07%。超过一半的dbSNP变异位于LQLC区域。在先前研究中确定的拷贝数可变区域中,86.3%位于LQLC区域。这些区域也富含来自RNA-seq数据的基因预测,其中42.98%位于LQLC区域。在未来的分析中排除LQ、LC或LQLC区域中的变异将有助于减少假阳性变异检测的数量。使用全基因组测序数据的研究人员应意识到,当前的猪参考基因组并不能准确反映原始杜洛克母猪基因组中等位基因的拷贝数。