Meisner Jonas, Liu Siyang, Huang Mingxi, Albrechtsen Anders
Department of Biology, University of Copenhagen, Copenhagen DK-2200, Denmark.
BGI-Shenzhen, Shenzhen 518083, China.
Bioinformatics. 2021 Jul 27;37(13):1868-1875. doi: 10.1093/bioinformatics/btab027.
Principal component analysis (PCA) is a commonly used tool in genetics to capture and visualize population structure. Due to technological advances in sequencing, such as the widely used non-invasive prenatal test, massive datasets of ultra-low coverage sequencing are being generated. These datasets are characterized by having a large amount of missing genotype information.
We present EMU, a method for inferring population structure in the presence of rampant non-random missingness. We show through simulations that several commonly used PCA methods cannot handle missing data arisen from various sources, which leads to biased results as individuals are projected into the PC space based on their amount of missingness. In terms of accuracy, EMU outperforms an existing method that also accommodates missingness while being competitively fast. We further tested EMU on around 100K individuals of the Phase 1 dataset of the Chinese Millionome Project, that were shallowly sequenced to around 0.08×. From this data we are able to capture the population structure of the Han Chinese and to reproduce previous analysis in a matter of CPU hours instead of CPU years. EMU's capability to accurately infer population structure in the presence of missingness will be of increasing importance with the rising number of large-scale genetic datasets.
EMU is written in Python and is freely available at https://github.com/rosemeis/emu.
Supplementary data are available at Bioinformatics online.
主成分分析(PCA)是遗传学中用于捕捉和可视化群体结构的常用工具。由于测序技术的进步,如广泛使用的无创产前检测,正在生成大量超低覆盖度测序的数据集。这些数据集的特点是存在大量缺失的基因型信息。
我们提出了EMU,一种在存在大量非随机缺失情况下推断群体结构的方法。我们通过模拟表明,几种常用的PCA方法无法处理来自各种来源的缺失数据,这会导致基于个体缺失量将其投影到主成分空间时产生有偏差的结果。在准确性方面,EMU优于一种现有的也能处理缺失情况的方法,同时速度具有竞争力。我们进一步在中国百万基因组计划第一阶段数据集的约10万个个体上测试了EMU,这些个体的测序深度约为0.08×。从这些数据中,我们能够捕捉汉族的群体结构,并在几个中央处理器小时内而非中央处理器年的时间内重现先前的分析。随着大规模遗传数据集数量的增加,EMU在存在缺失情况下准确推断群体结构的能力将变得越来越重要。
EMU用Python编写,可在https://github.com/rosemeis/emu上免费获取。
补充数据可在《生物信息学》在线获取。