Nazaretyan Lusiné, Rentzsch Philipp, Kircher Martin
Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, 10117, Germany.
Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, Stockholm, Sweden.
Genome Med. 2025 Aug 4;17(1):84. doi: 10.1186/s13073-025-01517-6.
BACKGROUND: Machine learning and artificial intelligence are increasingly being applied to identify phenotypically causal genetic variation. These data-driven methods require comprehensive training sets to deliver reliable results. However, large unbiased datasets for variant prioritization and effect predictions are rare as most of the available databases do not represent a broad ensemble of variant effects and are often biased towards the protein-coding genome, or even towards few well-studied genes. METHODS: To overcome these issues, we propose several alternative training sets derived from subsets of human standing variation. Specifically, we use variants identified from whole-genome sequences of 71,156 individuals contained in gnomAD v3.0 and approximate the benign set with frequent standing variation and the deleterious set with rare or singleton variation. We apply the Combined Annotation Dependent Depletion framework (CADD) and train several alternative models using CADD v1.6. RESULTS: Using the NCBI ClinVar validation set, we demonstrate that the alternative models have state-of-the-art accuracy, globally on par with deleteriousness scores of CADD v1.6 and v1.7, but also outperforming them in certain genomic regions. Being larger than conventional training datasets, including the evolutionary-derived training dataset of about 30 million variants in CADD, standing variation datasets cover a broader range of genomic regions and rare instances of the applied annotations. For example, they cover more recent evolutionary changes common in gene regulatory regions, which are more challenging to assess with conventional tools. CONCLUSIONS: Standing variation allows us to directly train state-of-the-art models for genome-wide variant prioritization or to augment evolutionary-derived variants in training. The proposed datasets have several advantages, like being substantially larger and potentially less biased. Datasets derived from standing variation represent natural allelic changes in the human genome and do not require extensive simulations and adaptations to annotations of evolutionary-derived sequence alterations used for CADD training. We provide datasets as well as trained models to the community for further development and application.
背景:机器学习和人工智能越来越多地应用于识别表型因果遗传变异。这些数据驱动的方法需要全面的训练集才能产生可靠的结果。然而,用于变异优先级排序和效应预测的大型无偏数据集很少见,因为大多数可用数据库不能代表广泛的变异效应集合,并且往往偏向于蛋白质编码基因组,甚至偏向于少数研究充分的基因。 方法:为了克服这些问题,我们提出了几个源自人类固定变异子集的替代训练集。具体而言,我们使用从gnomAD v3.0中包含的71156个人的全基因组序列中识别出的变异,并将常见的固定变异近似为良性集,将罕见或单例变异近似为有害集。我们应用联合注释依赖损耗框架(CADD),并使用CADD v1.6训练了几个替代模型。 结果:使用NCBI ClinVar验证集,我们证明替代模型具有一流的准确性,总体上与CADD v1.6和v1.7的有害性评分相当,但在某些基因组区域也优于它们。固定变异数据集比传统训练数据集更大,包括CADD中约3000万个变异的进化衍生训练数据集,它覆盖了更广泛的基因组区域和应用注释的罕见实例。例如,它们涵盖了基因调控区域中常见的更新的进化变化,而用传统工具评估这些变化更具挑战性。 结论:固定变异使我们能够直接训练用于全基因组变异优先级排序的一流模型,或在训练中增加进化衍生的变异。所提出的数据集有几个优点,比如规模大得多且可能偏差较小。源自固定变异的数据集代表了人类基因组中的自然等位基因变化,不需要进行广泛的模拟以及调整以适应用于CADD训练的进化衍生序列改变的注释。我们向社区提供数据集以及经过训练的模型,以供进一步开发和应用。
2025-1
Cochrane Database Syst Rev. 2014-4-29
Psychopharmacol Bull. 2024-7-8
NPJ Genom Med. 2024-1-9
Science. 2023-6-2
BMC Bioinformatics. 2023-5-12
Database (Oxford). 2023-4-26
Genome Biol Evol. 2023-5-5
N Engl J Med. 2023-4-27
Genes Immun. 2023-2
Nat Commun. 2022-7-25