Suppr超能文献

非随机缺失数据可能会使群体遗传结构的主成分分析推断产生偏差。

Nonrandom missing data can bias Principal Component Analysis inference of population genetic structure.

作者信息

Yi Xueling, Latch Emily K

机构信息

Behavioral and Molecular Ecology Research Group, Department of Biological Sciences, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA.

出版信息

Mol Ecol Resour. 2022 Feb;22(2):602-611. doi: 10.1111/1755-0998.13498. Epub 2021 Sep 9.

Abstract

Population genetic studies in non-model systems increasingly use next-generation sequencing to obtain more loci, but such methods also generate more missing data that may affect downstream analyses. Here we focus on the principal component analysis (PCA) which has been widely used to explore and visualize population structure with mean-imputed missing data. We simulated data of different population models with various total missingness (1%, 10%, 20%) introduced either randomly or biased among individuals or populations. We found that individuals biased with missing data would be dragged away from their real population clusters to the origin of PCA plots, making them indistinguishable from true admixed individuals and potentially leading to misinterpreted population structure. We also generated empirical data of the big brown bat (Eptesicus fuscus) using restriction site-associated DNA sequencing (RADseq). We filtered three data sets with 19.12%, 9.87%, and 1.35% total missingness, all showing nonrandom missing data with biased individuals dragged towards the PCA origin, consistent with results from simulations. We highlight the importance of considering missing data effects on PCA in non-model systems where nonrandom missing data are common due to varying sample quality. To help detect missing data effects, we suggest to (1) plot PCA with a colour gradient showing per sample missingness, (2) interpret samples close to the PCA origin with extra caution, (3) explore filtering parameters with and without the missingness-biased samples, and (4) use complementary analyses (e.g., model-based methods) to cross-validate PCA results and help interpret population structure.

摘要

在非模式生物系统中进行的群体遗传学研究越来越多地使用下一代测序技术来获取更多的基因座,但这些方法也会产生更多的缺失数据,这可能会影响下游分析。在这里,我们重点关注主成分分析(PCA),它已被广泛用于通过均值插补缺失数据来探索和可视化群体结构。我们模拟了不同群体模型的数据,引入了各种总体缺失率(1%、10%、20%),这些缺失数据在个体或群体中随机产生或存在偏差。我们发现,存在缺失数据偏差的个体将被从其真实的群体聚类中拖向主成分分析图的原点,使其与真正的混合个体无法区分,并可能导致对群体结构的错误解读。我们还使用限制性位点相关DNA测序(RADseq)生成了大棕蝠(Eptesicus fuscus)的实证数据。我们对三个总体缺失率分别为19.12%、9.87%和1.35%的数据集进行了筛选,所有数据集都显示出非随机缺失数据,存在偏差的个体被拖向主成分分析原点,这与模拟结果一致。我们强调了在非模式生物系统中考虑缺失数据对主成分分析影响的重要性,在这些系统中,由于样本质量不同,非随机缺失数据很常见。为了帮助检测缺失数据的影响,我们建议:(1)绘制主成分分析图,用颜色梯度表示每个样本的缺失率;(2)格外谨慎地解读靠近主成分分析原点的样本;(3)在有和没有缺失数据偏差样本的情况下探索过滤参数;(4)使用互补分析(如基于模型的方法)来交叉验证主成分分析结果,并帮助解读群体结构。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验