Yi Xueling, Latch Emily K
Behavioral and Molecular Ecology Research Group, Department of Biological Sciences, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA.
Mol Ecol Resour. 2022 Feb;22(2):602-611. doi: 10.1111/1755-0998.13498. Epub 2021 Sep 9.
Population genetic studies in non-model systems increasingly use next-generation sequencing to obtain more loci, but such methods also generate more missing data that may affect downstream analyses. Here we focus on the principal component analysis (PCA) which has been widely used to explore and visualize population structure with mean-imputed missing data. We simulated data of different population models with various total missingness (1%, 10%, 20%) introduced either randomly or biased among individuals or populations. We found that individuals biased with missing data would be dragged away from their real population clusters to the origin of PCA plots, making them indistinguishable from true admixed individuals and potentially leading to misinterpreted population structure. We also generated empirical data of the big brown bat (Eptesicus fuscus) using restriction site-associated DNA sequencing (RADseq). We filtered three data sets with 19.12%, 9.87%, and 1.35% total missingness, all showing nonrandom missing data with biased individuals dragged towards the PCA origin, consistent with results from simulations. We highlight the importance of considering missing data effects on PCA in non-model systems where nonrandom missing data are common due to varying sample quality. To help detect missing data effects, we suggest to (1) plot PCA with a colour gradient showing per sample missingness, (2) interpret samples close to the PCA origin with extra caution, (3) explore filtering parameters with and without the missingness-biased samples, and (4) use complementary analyses (e.g., model-based methods) to cross-validate PCA results and help interpret population structure.
在非模式生物系统中进行的群体遗传学研究越来越多地使用下一代测序技术来获取更多的基因座,但这些方法也会产生更多的缺失数据,这可能会影响下游分析。在这里,我们重点关注主成分分析(PCA),它已被广泛用于通过均值插补缺失数据来探索和可视化群体结构。我们模拟了不同群体模型的数据,引入了各种总体缺失率(1%、10%、20%),这些缺失数据在个体或群体中随机产生或存在偏差。我们发现,存在缺失数据偏差的个体将被从其真实的群体聚类中拖向主成分分析图的原点,使其与真正的混合个体无法区分,并可能导致对群体结构的错误解读。我们还使用限制性位点相关DNA测序(RADseq)生成了大棕蝠(Eptesicus fuscus)的实证数据。我们对三个总体缺失率分别为19.12%、9.87%和1.35%的数据集进行了筛选,所有数据集都显示出非随机缺失数据,存在偏差的个体被拖向主成分分析原点,这与模拟结果一致。我们强调了在非模式生物系统中考虑缺失数据对主成分分析影响的重要性,在这些系统中,由于样本质量不同,非随机缺失数据很常见。为了帮助检测缺失数据的影响,我们建议:(1)绘制主成分分析图,用颜色梯度表示每个样本的缺失率;(2)格外谨慎地解读靠近主成分分析原点的样本;(3)在有和没有缺失数据偏差样本的情况下探索过滤参数;(4)使用互补分析(如基于模型的方法)来交叉验证主成分分析结果,并帮助解读群体结构。