Hawkesbury Institute for the Environment, Western Sydney University, Richmond, NSW, Australia.
CSIRO Land & Water, Hobart, Tas., Australia.
Mol Ecol Resour. 2021 Jul;21(5):1460-1474. doi: 10.1111/1755-0998.13351. Epub 2021 Mar 9.
Genotype-environment association (GEA) methods have become part of the standard landscape genomics toolkit, yet, we know little about how to best filter genotype-by-sequencing data to provide robust inferences for environmental adaptation. In many cases, default filtering thresholds for minor allele frequency and missing data are applied regardless of sample size, having unknown impacts on the results, negatively affecting management strategies. Here, we investigate the effects of filtering on GEA results and the potential implications for assessment of adaptation to environment. We use empirical and simulated data sets derived from two widespread tree species to assess the effects of filtering on GEA outputs. Critically, we find that the level of filtering of missing data and minor allele frequency affect the identification of true positives. Even slight adjustments to these thresholds can change the rate of true positive detection. Using conservative thresholds for missing data and minor allele frequency substantially reduces the size of the data set, lessening the power to detect adaptive variants (i.e., simulated true positives) with strong and weak strengths of selection. Regardless, strength of selection was a good predictor for GEA detection, but even some SNPs under strong selection went undetected. False positive rates varied depending on the species and GEA method, and filtering significantly impacted the predictions of adaptive capacity in downstream analyses. We make several recommendations regarding filtering for GEA methods. Ultimately, there is no filtering panacea, but some choices are better than others, depending on the study system, availability of genomic resources, and desired objectives.
基因型-环境关联 (GEA) 方法已成为标准基因组学工具包的一部分,但我们对如何最好地过滤基因型测序数据以提供环境适应的稳健推断知之甚少。在许多情况下,无论样本量如何,都应用了次要等位基因频率和缺失数据的默认过滤阈值,这对结果产生了未知的影响,对管理策略产生了负面影响。在这里,我们研究了过滤对 GEA 结果的影响以及对环境适应评估的潜在影响。我们使用来自两种广泛分布的树种的经验和模拟数据集来评估过滤对 GEA 输出的影响。至关重要的是,我们发现缺失数据和次要等位基因频率的过滤水平会影响真阳性的识别。即使对这些阈值进行轻微调整,也会改变真阳性检测率。对缺失数据和次要等位基因频率使用保守的阈值会大大减小数据集的大小,从而降低检测具有强和弱选择强度的适应性变体(即模拟真阳性)的能力。尽管如此,选择强度仍然是 GEA 检测的良好预测指标,但即使是一些处于强选择下的 SNP 也未被检测到。假阳性率取决于物种和 GEA 方法,过滤对下游分析中适应性能力的预测有重大影响。我们对 GEA 方法的过滤提出了一些建议。最终,没有万能的过滤方法,但根据研究系统、基因组资源的可用性和预期目标,某些选择比其他选择更好。