Department of Computer Science, Aalto University, Espoo 02150, Finland.
Institute for Molecular Medicine Finland, FIMM, HiLIFE, University of Helsinki, Helsinki 00014, Finland.
Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad535.
Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.
We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.
A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.
现有的模拟合成基因型和表型数据集的方法在可扩展性方面存在限制,这限制了它们在大规模分析中的可用性。此外,缺乏用于评估合成数据质量的系统方法和用于开发和评估多基因风险评分方法的基准合成数据集。
我们提出了 HAPNEST,这是一种高效生成多样化个体水平基因型和表型数据的新方法。与替代方法相比,HAPNEST 显示出更快的计算速度和与参考面板的相关性较低,同时生成的数据保留了真实数据的关键统计特性。这些理想的合成数据特性使我们能够在 100 万个个体中生成 680 万个常见变体和 9 个表型,这些表型具有不同程度的遗传性和多基因性。我们通过比较七种方法在多个血统组和不同遗传结构中生成多基因风险评分,展示了 HAPNEST 如何通过生物库规模的分析来促进分析。
一个包含 1008000 个个体和 9 个特征的 680 万个常见变体的合成数据集可在 https://www.ebi.ac.uk/biostudies/studies/S-BSST936 上获得。用于生成合成数据集的 HAPNEST 软件可作为 Docker/Singularity 容器以及位于 https://github.com/intervene-EU-H2020/synthetic_data 的开源 Julia 和 C 代码获得。