Suppr超能文献

HAPNEST:高效、大规模生成和评估基因型和表型的合成数据集。

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.

机构信息

Department of Computer Science, Aalto University, Espoo 02150, Finland.

Institute for Molecular Medicine Finland, FIMM, HiLIFE, University of Helsinki, Helsinki 00014, Finland.

出版信息

Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad535.

Abstract

MOTIVATION

Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.

RESULTS

We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.

AVAILABILITY AND IMPLEMENTATION

A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.

摘要

动机

现有的模拟合成基因型和表型数据集的方法在可扩展性方面存在限制,这限制了它们在大规模分析中的可用性。此外,缺乏用于评估合成数据质量的系统方法和用于开发和评估多基因风险评分方法的基准合成数据集。

结果

我们提出了 HAPNEST,这是一种高效生成多样化个体水平基因型和表型数据的新方法。与替代方法相比,HAPNEST 显示出更快的计算速度和与参考面板的相关性较低,同时生成的数据保留了真实数据的关键统计特性。这些理想的合成数据特性使我们能够在 100 万个个体中生成 680 万个常见变体和 9 个表型,这些表型具有不同程度的遗传性和多基因性。我们通过比较七种方法在多个血统组和不同遗传结构中生成多基因风险评分,展示了 HAPNEST 如何通过生物库规模的分析来促进分析。

可用性和实现

一个包含 1008000 个个体和 9 个特征的 680 万个常见变体的合成数据集可在 https://www.ebi.ac.uk/biostudies/studies/S-BSST936 上获得。用于生成合成数据集的 HAPNEST 软件可作为 Docker/Singularity 容器以及位于 https://github.com/intervene-EU-H2020/synthetic_data 的开源 Julia 和 C 代码获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2659/10493177/c7e3b47009ea/btad535f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验