Marine Genomics Laboratory, Department of Life Sciences, Texas A&M University-Corpus Christi, 6300 Ocean Drive, Corpus Christi, TX, 78412, USA.
Marine Science Center, Northeastern University, 430 Nahant RD, Nahant, MA, 01908, USA.
Mol Ecol Resour. 2017 Sep;17(5):955-965. doi: 10.1111/1755-0998.12647. Epub 2017 Feb 9.
Next-generation sequencing of reduced-representation genomic libraries provides a powerful methodology for genotyping thousands of single-nucleotide polymorphisms (SNPs) among individuals of nonmodel species. Utilizing genotype data in the absence of a reference genome, however, presents a number of challenges. One major challenge is the trade-off between splitting alleles at a single locus into separate clusters (loci), creating inflated homozygosity, and lumping multiple loci into a single contig (locus), creating artefacts and inflated heterozygosity. This issue has been addressed primarily through the use of similarity cut-offs in sequence clustering. Here, two commonly employed, postclustering filtering methods (read depth and excess heterozygosity) used to identify incorrectly assembled loci are compared with haplotyping, another postclustering filtering approach. Simulated and empirical data sets were used to demonstrate that each of the three methods separately identified incorrectly assembled loci; more optimal results were achieved when the three methods were applied in combination. The results confirmed that including incorrectly assembled loci in population-genetic data sets inflates estimates of heterozygosity and deflates estimates of population divergence. Additionally, at low levels of population divergence, physical linkage between SNPs within a locus created artificial clustering in analyses that assume markers are independent. Haplotyping SNPs within a locus effectively neutralized the physical linkage issue without having to thin data to a single SNP per locus. We introduce a Perl script that haplotypes polymorphisms, using data from single or paired-end reads, and identifies potentially problematic loci.
下一代简化基因组文库测序为非模式物种个体中数千个单核苷酸多态性(SNP)的基因分型提供了一种强大的方法。然而,在缺乏参考基因组的情况下利用基因型数据存在许多挑战。一个主要的挑战是在单个基因座处将等位基因分裂成单独的聚类(基因座),从而产生膨胀的纯合性,或者将多个基因座合并到单个连续体(基因座)中,从而产生假象和膨胀的杂合性。这个问题主要通过在序列聚类中使用相似性截止值来解决。在这里,比较了两种常用的聚类后过滤方法(读深度和过剩杂合性),用于识别错误组装的基因座,另一种聚类后过滤方法是单倍型分析。使用模拟和经验数据集来证明这三种方法单独地都可以识别错误组装的基因座;当三种方法联合使用时,会得到更优的结果。结果证实,将错误组装的基因座纳入种群遗传数据集会增加杂合度的估计值并降低种群分歧的估计值。此外,在种群分歧程度较低的情况下,基因座内 SNP 之间的物理连锁在假定标记是独立的分析中会产生人为聚类。对基因座内的 SNP 进行单倍型分析可以有效地解决物理连锁问题,而无需将数据缩减到每个基因座一个 SNP。我们引入了一个 Perl 脚本,可以使用单端或双端读取的数据进行多态性单倍型分析,并识别潜在的有问题的基因座。