Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
Flathead Lake Biological Station, Wildlife Biology Program and Division of Biological Sciences, University of Montana, Missoula, MT, USA.
Nat Rev Genet. 2024 Nov;25(11):750-767. doi: 10.1038/s41576-024-00738-6. Epub 2024 Jun 14.
Genomic data are ubiquitous across disciplines, from agriculture to biodiversity, ecology, evolution and human health. However, these datasets often contain noise or errors and are missing information that can affect the accuracy and reliability of subsequent computational analyses and conclusions. A key step in genomic data analysis is filtering - removing sequencing bases, reads, genetic variants and/or individuals from a dataset - to improve data quality for downstream analyses. Researchers are confronted with a multitude of choices when filtering genomic data; they must choose which filters to apply and select appropriate thresholds. To help usher in the next generation of genomic data filtering, we review and suggest best practices to improve the implementation, reproducibility and reporting standards for filter types and thresholds commonly applied to genomic datasets. We focus mainly on filters for minor allele frequency, missing data per individual or per locus, linkage disequilibrium and Hardy-Weinberg deviations. Using simulated and empirical datasets, we illustrate the large effects of different filtering thresholds on common population genetics statistics, such as Tajima's D value, population differentiation (F), nucleotide diversity (π) and effective population size (N).
基因组数据在从农业到生物多样性、生态学、进化和人类健康等各个领域都无处不在。然而,这些数据集通常包含噪声或错误,并且缺少信息,这可能会影响后续计算分析和结论的准确性和可靠性。基因组数据分析的关键步骤之一是过滤-从数据集中去除测序碱基、读取、遗传变异和/或个体-以提高下游分析的数据质量。研究人员在过滤基因组数据时面临着众多选择;他们必须选择要应用的过滤器并选择适当的阈值。为了帮助迎来下一代基因组数据过滤,我们审查并建议改进常用过滤类型和阈值的实施、可重复性和报告标准的最佳实践,这些过滤类型和阈值通常应用于基因组数据集。我们主要关注用于次要等位基因频率、每个个体或每个基因座的缺失数据、连锁不平衡和 Hardy-Weinberg 偏离的过滤器。使用模拟和经验数据集,我们说明了不同过滤阈值对常见群体遗传学统计数据(如 Tajima 的 D 值、种群分化(F)、核苷酸多样性(π)和有效种群大小(N))的影响。