Department of Ecology and Evolutionary Biology, La Kretz Center for California Conservation Science, and Institute of the Environment and Sustainability, University of California, Los Angeles, California.
Faculty of Arts and Science, Department of Biochemistry, Kütahya Dumlupınar University, Kutahya, Turkey.
Mol Ecol Resour. 2019 Sep;19(5):1195-1204. doi: 10.1111/1755-0998.13029. Epub 2019 Jun 1.
Genomic data are increasingly used for high resolution population genetic studies including those at the forefront of biological conservation. A key methodological challenge is determining sequence similarity clustering thresholds for RADseq data when no reference genome is available. These thresholds define the maximum permitted divergence among allelic variants and the minimum divergence among putative paralogues and are central to downstream population genomic analyses. Here we develop a novel set of metrics to determine sequence similarity thresholds that maximize the correct separation of paralogous regions and minimize oversplitting naturally occurring allelic variation within loci. These metrics empirically identify the threshold value at which true alleles at opposite ends of several major axes of genetic variation begin to incorrectly separate into distinct clusters, allowing researchers to choose thresholds just below this value. We test our approach on a recently published data set for the protected foothill yellow-legged frog (Rana boylii). The metrics recover a consistent pattern of roughly 96% similarity as a threshold above which genetic divergence and data missingness become increasingly correlated. We provide scripts for assessing different clustering thresholds and discuss how this approach can be applied across a wide range of empirical data sets.
基因组数据越来越多地用于高分辨率的群体遗传研究,包括处于生物学保护前沿的研究。当没有参考基因组时,确定 RADseq 数据的序列相似性聚类阈值是一个关键的方法学挑战。这些阈值定义了等位变异之间允许的最大差异以及假定的同源基因之间的最小差异,是下游群体基因组分析的核心。在这里,我们开发了一套新的指标来确定序列相似性阈值,这些阈值可以最大限度地正确分离同源区域,并最小化对自然发生的等位基因变异的过度分割。这些指标从经验上确定了在几个主要遗传变异轴的末端的真实等位基因开始错误地分为不同聚类的阈值值,允许研究人员选择略低于该值的阈值。我们在最近发表的保护黄腿蛙(Rana boylii)数据集上测试了我们的方法。这些指标恢复了一个一致的模式,即相似度约为 96%,超过这个阈值,遗传分化和数据缺失变得越来越相关。我们提供了评估不同聚类阈值的脚本,并讨论了如何将这种方法应用于广泛的经验数据集。