• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用主成分分析(PCA)在存在缺失值的情况下对群体结构进行大规模推断。

Large-scale inference of population structure in presence of missingness using PCA.

作者信息

Meisner Jonas, Liu Siyang, Huang Mingxi, Albrechtsen Anders

机构信息

Department of Biology, University of Copenhagen, Copenhagen DK-2200, Denmark.

BGI-Shenzhen, Shenzhen 518083, China.

出版信息

Bioinformatics. 2021 Jul 27;37(13):1868-1875. doi: 10.1093/bioinformatics/btab027.

DOI:10.1093/bioinformatics/btab027
PMID:33459779
Abstract

MOTIVATION

Principal component analysis (PCA) is a commonly used tool in genetics to capture and visualize population structure. Due to technological advances in sequencing, such as the widely used non-invasive prenatal test, massive datasets of ultra-low coverage sequencing are being generated. These datasets are characterized by having a large amount of missing genotype information.

RESULTS

We present EMU, a method for inferring population structure in the presence of rampant non-random missingness. We show through simulations that several commonly used PCA methods cannot handle missing data arisen from various sources, which leads to biased results as individuals are projected into the PC space based on their amount of missingness. In terms of accuracy, EMU outperforms an existing method that also accommodates missingness while being competitively fast. We further tested EMU on around 100K individuals of the Phase 1 dataset of the Chinese Millionome Project, that were shallowly sequenced to around 0.08×. From this data we are able to capture the population structure of the Han Chinese and to reproduce previous analysis in a matter of CPU hours instead of CPU years. EMU's capability to accurately infer population structure in the presence of missingness will be of increasing importance with the rising number of large-scale genetic datasets.

AVAILABILITY AND IMPLEMENTATION

EMU is written in Python and is freely available at https://github.com/rosemeis/emu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

主成分分析(PCA)是遗传学中用于捕捉和可视化群体结构的常用工具。由于测序技术的进步,如广泛使用的无创产前检测,正在生成大量超低覆盖度测序的数据集。这些数据集的特点是存在大量缺失的基因型信息。

结果

我们提出了EMU,一种在存在大量非随机缺失情况下推断群体结构的方法。我们通过模拟表明,几种常用的PCA方法无法处理来自各种来源的缺失数据,这会导致基于个体缺失量将其投影到主成分空间时产生有偏差的结果。在准确性方面,EMU优于一种现有的也能处理缺失情况的方法,同时速度具有竞争力。我们进一步在中国百万基因组计划第一阶段数据集的约10万个个体上测试了EMU,这些个体的测序深度约为0.08×。从这些数据中,我们能够捕捉汉族的群体结构,并在几个中央处理器小时内而非中央处理器年的时间内重现先前的分析。随着大规模遗传数据集数量的增加,EMU在存在缺失情况下准确推断群体结构的能力将变得越来越重要。

可用性和实现方式

EMU用Python编写,可在https://github.com/rosemeis/emu上免费获取。

补充信息

补充数据可在《生物信息学》在线获取。

相似文献

1
Large-scale inference of population structure in presence of missingness using PCA.使用主成分分析(PCA)在存在缺失值的情况下对群体结构进行大规模推断。
Bioinformatics. 2021 Jul 27;37(13):1868-1875. doi: 10.1093/bioinformatics/btab027.
2
Nonrandom missing data can bias Principal Component Analysis inference of population genetic structure.非随机缺失数据可能会使群体遗传结构的主成分分析推断产生偏差。
Mol Ecol Resour. 2022 Feb;22(2):602-611. doi: 10.1111/1755-0998.13498. Epub 2021 Sep 9.
3
Efficient toolkit implementing best practices for principal component analysis of population genetic data.高效工具包,实现了群体遗传数据主成分分析的最佳实践。
Bioinformatics. 2020 Aug 15;36(16):4449-4457. doi: 10.1093/bioinformatics/btaa520.
4
Bayesian integrative model for multi-omics data with missingness.贝叶斯综合模型在多组学数据缺失中的应用。
Bioinformatics. 2018 Nov 15;34(22):3801-3808. doi: 10.1093/bioinformatics/bty775.
5
Fast and robust ancestry prediction using principal component analysis.利用主成分分析进行快速稳健的祖源预测。
Bioinformatics. 2020 Jun 1;36(11):3439-3446. doi: 10.1093/bioinformatics/btaa152.
6
RobustClone: a robust PCA method for tumor clone and evolution inference from single-cell sequencing data.RobustClone:一种稳健的 PCA 方法,用于从单细胞测序数据中推断肿瘤克隆和进化。
Bioinformatics. 2020 Jun 1;36(11):3299-3306. doi: 10.1093/bioinformatics/btaa172.
7
FlashPCA2: principal component analysis of Biobank-scale genotype datasets.FlashPCA2:生物样本库规模基因型数据集的主成分分析
Bioinformatics. 2017 Sep 1;33(17):2776-2778. doi: 10.1093/bioinformatics/btx299.
8
Inference of gene regulatory networks based on nonlinear ordinary differential equations.基于非线性常微分方程的基因调控网络推断。
Bioinformatics. 2020 Dec 8;36(19):4885-4893. doi: 10.1093/bioinformatics/btaa032.
9
TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes.TeraPCA:一个快速且可扩展的软件包,用于研究万亿级基因型中的遗传变异。
Bioinformatics. 2019 Oct 1;35(19):3679-3683. doi: 10.1093/bioinformatics/btz157.
10
GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis.GRAF-pop:一种无需主成分分析即可基于距离推断个体祖先的快速方法,适用于多种基因型数据集。
G3 (Bethesda). 2019 Aug 8;9(8):2447-2461. doi: 10.1534/g3.118.200925.

引用本文的文献

1
A genealogy-based approach for revealing ancestry-specific structures in admixed populations.一种基于系谱学的方法,用于揭示混合群体中特定祖先的结构。
Am J Hum Genet. 2025 Jul 17. doi: 10.1016/j.ajhg.2025.06.016.
2
Metagenomic biodiversity assessment within an offshore wind farm.海上风电场内的宏基因组生物多样性评估。
Sci Rep. 2025 May 14;15(1):16786. doi: 10.1038/s41598-025-01541-x.
3
A genealogy-based approach for revealing ancestry-specific structures in admixed populations.一种基于谱系的方法,用于揭示混合群体中特定祖先的结构。
bioRxiv. 2025 Jan 14:2025.01.10.632475. doi: 10.1101/2025.01.10.632475.
4
DORA: an interactive map for the visualization and analysis of ancient human DNA and associated data.DORA:一个用于可视化和分析古人类 DNA 及相关数据的交互式图谱。
Nucleic Acids Res. 2024 Jul 5;52(W1):W54-W60. doi: 10.1093/nar/gkae373.
5
The origins and diversification of Holarctic brown bear populations inferred from genomes of past and present populations.基于过去和现今种群的基因组推断全北界棕熊种群的起源和分化。
Proc Biol Sci. 2024 Jan 31;291(2015):20232411. doi: 10.1098/rspb.2023.2411. Epub 2024 Jan 24.
6
Genome-wide variation in the Angolan Namib Desert reveals unique pre-Bantu ancestry.安哥拉纳米布沙漠的全基因组变异揭示了独特的前班图人祖先。
Sci Adv. 2023 Sep 22;9(38):eadh3822. doi: 10.1126/sciadv.adh3822.
7
Fast and accurate out-of-core PCA framework for large scale biobank data.用于大规模生物库数据的快速准确的核外 PCA 框架。
Genome Res. 2023 Sep;33(9):1599-1608. doi: 10.1101/gr.277525.122. Epub 2023 Aug 24.
8
Estimating human mobility in Holocene Western Eurasia with large-scale ancient genomic data.利用大规模古基因组数据估算全新世欧洲西部的人类迁移。
Proc Natl Acad Sci U S A. 2023 Feb 28;120(9):e2218375120. doi: 10.1073/pnas.2218375120. Epub 2023 Feb 23.
9
The Genetic Population Structure of Lake Tanganyika's Lates Species Flock, an Endemic Radiation of Pelagic Top Predators.东非坦噶尼喀湖特有辐射种群——洄游性上层掠食者尖吻鲈属鱼类的遗传种群结构。
J Hered. 2022 May 16;113(2):145-159. doi: 10.1093/jhered/esab072.
10
A geometric relationship of , and -statistics with principal component analysis.一种关于、和统计量与主成分分析的几何关系。 (注:原文中“ and -statistics”表述不太完整准确,可能存在信息缺失)
Philos Trans R Soc Lond B Biol Sci. 2022 Jun 6;377(1852):20200413. doi: 10.1098/rstb.2020.0413. Epub 2022 Apr 18.