Money Daniel, Migicovsky Zoë, Gardner Kyle, Myles Sean
Department of Plant and Animal Sciences, Faculty of Agriculture, Dalhousie University, Truro, Nova Scotia, Canada.
BMC Genomics. 2017 Jul 10;18(1):523. doi: 10.1186/s12864-017-3873-5.
Genomic studies such as genome-wide association and genomic selection require genome-wide genotype data. All existing technologies used to create these data result in missing genotypes, which are often then inferred using genotype imputation software. However, existing imputation methods most often make use only of genotypes that are successfully inferred after having passed a certain read depth threshold. Because of this, any read information for genotypes that did not pass the threshold, and were thus set to missing, is ignored. Most genomic studies also choose read depth thresholds and quality filters without investigating their effects on the size and quality of the resulting genotype data. Moreover, almost all genotype imputation methods require ordered markers and are therefore of limited utility in non-model organisms.
Here we introduce LinkImputeR, a software program that exploits the read count information that is normally ignored, and makes use of all available DNA sequence information for the purposes of genotype calling and imputation. It is specifically designed for non-model organisms since it requires neither ordered markers nor a reference panel of genotypes. Using next-generation DNA sequence (NGS) data from apple, cannabis and grape, we quantify the effect of varying read count and missingness thresholds on the quantity and quality of genotypes generated from LinkImputeR. We demonstrate that LinkImputeR can increase the number of genotype calls by more than an order of magnitude, can improve genotyping accuracy by several percent and can thus improve the power of downstream analyses. Moreover, we show that the effects of quality and read depth filters can differ substantially between data sets and should therefore be investigated on a per-study basis.
By exploiting DNA sequence data that is normally ignored during genotype calling and imputation, LinkImputeR can significantly improve both the quantity and quality of genotype data generated from NGS technologies. It enables the user to quickly and easily examine the effects of varying thresholds and filters on the number and quality of the resulting genotype calls. In this manner, users can decide on thresholds that are most suitable for their purposes. We show that LinkImputeR can significantly augment the value and utility of NGS data sets, especially in non-model organisms with poor genomic resources.
全基因组关联研究和基因组选择等基因组学研究需要全基因组基因型数据。用于生成这些数据的所有现有技术都会导致基因型缺失,这些缺失的基因型通常随后会使用基因型填充软件进行推断。然而,现有的填充方法通常仅利用在通过特定读取深度阈值后成功推断出的基因型。因此,任何未通过阈值并因此被设为缺失的基因型的读取信息都被忽略了。大多数基因组学研究在选择读取深度阈值和质量过滤器时,也未考察它们对所得基因型数据的规模和质量的影响。此外,几乎所有的基因型填充方法都需要有序的标记,因此在非模式生物中的实用性有限。
在此,我们介绍LinkImputeR,这是一款软件程序,它利用通常被忽略的读取计数信息,并利用所有可用的DNA序列信息进行基因型分型和填充。它是专门为非模式生物设计的,因为它既不需要有序的标记,也不需要基因型参考面板。利用来自苹果、大麻和葡萄的下一代DNA序列(NGS)数据,我们量化了不同读取计数和缺失阈值对LinkImputeR生成的基因型数量和质量的影响。我们证明,LinkImputeR可以将基因型分型的数量增加一个数量级以上,将基因分型准确性提高几个百分点,从而提高下游分析的效能。此外,我们表明,质量和读取深度过滤器的效果在不同数据集之间可能有很大差异,因此应该针对每项研究进行考察。
通过利用在基因型分型和填充过程中通常被忽略的DNA序列数据,LinkImputeR可以显著提高从NGS技术生成的基因型数据的数量和质量。它使用户能够快速轻松地检查不同阈值和过滤器对所得基因型分型数量和质量产生的影响。通过这种方式,用户可以确定最适合其目的的阈值。我们表明,LinkImputeR可以显著提高NGS数据集的价值和实用性,特别是在基因组资源匮乏的非模式生物中。