HAPNEST：高效、大规模生成和评估基因型和表型的合成数据集。

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.

机构信息

Department of Computer Science, Aalto University, Espoo 02150, Finland.

Institute for Molecular Medicine Finland, FIMM, HiLIFE, University of Helsinki, Helsinki 00014, Finland.

出版信息

Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad535.

DOI:10.1093/bioinformatics/btad535

PMID:37647640

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10493177/

Abstract

MOTIVATION

Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.

RESULTS

We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.

AVAILABILITY AND IMPLEMENTATION

A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.

摘要

动机

现有的模拟合成基因型和表型数据集的方法在可扩展性方面存在限制，这限制了它们在大规模分析中的可用性。此外，缺乏用于评估合成数据质量的系统方法和用于开发和评估多基因风险评分方法的基准合成数据集。

结果

我们提出了 HAPNEST，这是一种高效生成多样化个体水平基因型和表型数据的新方法。与替代方法相比，HAPNEST 显示出更快的计算速度和与参考面板的相关性较低，同时生成的数据保留了真实数据的关键统计特性。这些理想的合成数据特性使我们能够在 100 万个个体中生成 680 万个常见变体和 9 个表型，这些表型具有不同程度的遗传性和多基因性。我们通过比较七种方法在多个血统组和不同遗传结构中生成多基因风险评分，展示了 HAPNEST 如何通过生物库规模的分析来促进分析。

可用性和实现

一个包含 1008000 个个体和 9 个特征的 680 万个常见变体的合成数据集可在 https://www.ebi.ac.uk/biostudies/studies/S-BSST936 上获得。用于生成合成数据集的 HAPNEST 软件可作为 Docker/Singularity 容器以及位于 https://github.com/intervene-EU-H2020/synthetic_data 的开源 Julia 和 C 代码获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2659/10493177/c7e3b47009ea/btad535f1.jpg

相似文献

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.HAPNEST：高效、大规模生成和评估基因型和表型的合成数据集。

Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad535.

The GenoPred pipeline: a comprehensive and scalable pipeline for polygenic scoring.GenoPred 管道：一种全面且可扩展的多基因评分管道。

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae551.

SummaryAUC: a tool for evaluating the performance of polygenic risk prediction models in validation datasets with only summary level statistics.摘要：AUC：一种用于评估仅使用汇总统计数据的验证数据集的多基因风险预测模型性能的工具。

Bioinformatics. 2019 Oct 15;35(20):4038-4044. doi: 10.1093/bioinformatics/btz176.

Seqminer2: an efficient tool to query and retrieve genotypes for statistical genetics analyses from biobank scale sequence dataset.Seqminer2：一种高效的工具，可从生物库规模的序列数据集中查询和检索用于统计遗传学分析的基因型。

Bioinformatics. 2020 Dec 8;36(19):4951-4954. doi: 10.1093/bioinformatics/btaa628.

Evolink: a phylogenetic approach for rapid identification of genotype-phenotype associations in large-scale microbial multispecies data.Evolink：一种在大规模微生物多物种数据中快速识别基因型-表型关联的系统发育方法。

Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad215.

CompAIRR: ultra-fast comparison of adaptive immune receptor repertoires by exact and approximate sequence matching.CompAIRR：通过精确和近似序列匹配进行适应性免疫受体库的超快速比较。

Bioinformatics. 2022 Sep 2;38(17):4230-4232. doi: 10.1093/bioinformatics/btac505.

Improving the computation efficiency of polygenic risk score modeling: faster in Julia.提高多基因风险评分建模的计算效率：Julia 更快。

Life Sci Alliance. 2022 Jul 18;5(12):e202201382. doi: 10.26508/lsa.202201382.

Assessing polygenic risk score models for applications in populations with under-represented genomics data: an example of Vietnam.评估在基因组学数据代表性不足的人群中应用的多基因风险评分模型：以越南为例。

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac459.

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.使用两个 R 包：bigstatsr 和 bigsnpr，高效分析大规模全基因组数据。

Bioinformatics. 2018 Aug 15;34(16):2781-2787. doi: 10.1093/bioinformatics/bty185.

Very low-depth whole-genome sequencing in complex trait association studies.复杂性状关联研究中的极低深度全基因组测序。

Bioinformatics. 2019 Aug 1;35(15):2555-2561. doi: 10.1093/bioinformatics/bty1032.

引用本文的文献

Generating synthetic genotypes using diffusion models.使用扩散模型生成合成基因型。

Bioinformatics. 2025 Jul 1;41(Supplement_1):i484-i492. doi: 10.1093/bioinformatics/btaf209.

HAP-SAMPLE2: data-based resampling for association studies with admixture.HAP-SAMPLE2：用于混合样本关联研究的基于数据的重采样

Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf333.

PGSXplorer: an integrated nextflow pipeline for comprehensive quality control and polygenic score model development.PGSXplorer：一个用于全面质量控制和多基因评分模型开发的集成式Nextflow工作流程。

PeerJ. 2025 Feb 12;13:e18973. doi: 10.7717/peerj.18973. eCollection 2025.

Challenges and applications in generative AI for clinical tabular data in physiology.生理学临床表格数据生成式人工智能中的挑战与应用

Pflugers Arch. 2025 Apr;477(4):531-542. doi: 10.1007/s00424-024-03024-w. Epub 2024 Oct 17.

Synthetic data can aid the analysis of clinical outcomes: How much can it be trusted?合成数据有助于临床结果分析：其可信度有多高？

Proc Natl Acad Sci U S A. 2024 Aug 6;121(32):e2414310121. doi: 10.1073/pnas.2414310121. Epub 2024 Jul 31.

tstrait: a quantitative trait simulator for ancestral recombination graphs.ts 海峡：祖先重组图的数量性状模拟器。

Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae334.

A resampling-based approach to share reference panels.基于重采样的共享参考面板方法。

Nat Comput Sci. 2024 May;4(5):360-366. doi: 10.1038/s43588-024-00630-7. Epub 2024 May 14.

tstrait: a quantitative trait simulator for ancestral recombination graphs.tstrait：一种用于祖先重组图的数量性状模拟器。

bioRxiv. 2024 Mar 14:2024.03.13.584790. doi: 10.1101/2024.03.13.584790.

A benchmark study on current GWAS models in admixed populations.混合人群中当前 GWAS 模型的基准研究。

Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad437.

Power of inclusion: Enhancing polygenic prediction with admixed individuals.包容性的力量：利用混合个体增强多基因预测。

Am J Hum Genet. 2023 Nov 2;110(11):1888-1902. doi: 10.1016/j.ajhg.2023.09.013. Epub 2023 Oct 27.

本文引用的文献

Global Biobank analyses provide lessons for developing polygenic risk scores across diverse cohorts.全球生物样本库分析为在不同队列中开发多基因风险评分提供了经验教训。

Cell Genom. 2023 Jan 4;3(1):100241. doi: 10.1016/j.xgen.2022.100241. eCollection 2023 Jan 11.

Genetic and environmental variation impact transferability of polygenic risk scores.遗传和环境变异影响多基因风险评分的可转移性。

Cell Rep Med. 2022 Jul 19;3(7):100687. doi: 10.1016/j.xcrm.2022.100687.

Current Developments in Detection of Identity-by-Descent Methods and Applications.同源性检测方法的当前发展与应用

Front Genet. 2021 Sep 10;12:722602. doi: 10.3389/fgene.2021.722602. eCollection 2021.

Evaluation of polygenic prediction methodology within a reference-standardized framework.在参考标准化框架内评估多基因预测方法。

PLoS Genet. 2021 May 4;17(5):e1009021. doi: 10.1371/journal.pgen.1009021. eCollection 2021 May.

Probabilistic Estimation of Identity by Descent Segment Endpoints and Detection of Recent Selection.通过血统片段末端的概率估计和近期选择的检测来估计身份。

Am J Hum Genet. 2020 Nov 5;107(5):895-910. doi: 10.1016/j.ajhg.2020.09.010. Epub 2020 Oct 13.

Tutorial: a guide to performing polygenic risk score analyses.教程：多基因风险评分分析操作指南。

Nat Protoc. 2020 Sep;15(9):2759-2772. doi: 10.1038/s41596-020-0353-1. Epub 2020 Jul 24.

The mutational constraint spectrum quantified from variation in 141,456 humans.从 141456 名人类个体的变异中量化的突变约束谱。

Nature. 2020 May;581(7809):434-443. doi: 10.1038/s41586-020-2308-7. Epub 2020 May 27.

A Fast and Simple Method for Detecting Identity-by-Descent Segments in Large-Scale Data.一种在大规模数据中快速简单检测同源片段的方法。

Am J Hum Genet. 2020 Apr 2;106(4):426-437. doi: 10.1016/j.ajhg.2020.02.010. Epub 2020 Mar 12.

The GWAS Diversity Monitor tracks diversity by disease in real time.全基因组关联研究多样性监测器实时追踪按疾病划分的多样性情况。

Nat Genet. 2020 Mar;52(3):242-243. doi: 10.1038/s41588-020-0580-y.

GpABC: a Julia package for approximate Bayesian computation with Gaussian process emulation.GpABC：一个使用高斯过程仿真进行近似贝叶斯计算的 Julia 包。

Bioinformatics. 2020 May 1;36(10):3286-3287. doi: 10.1093/bioinformatics/btaa078.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

HAPNEST：高效、大规模生成和评估基因型和表型的合成数据集。

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献