文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

可变CADD:大量的常见遗传变异有助于全基因组致病性预测。

varCADD: large sets of standing genetic variation enable genome-wide pathogenicity prediction.

作者信息

Nazaretyan Lusiné, Rentzsch Philipp, Kircher Martin

机构信息

Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, 10117, Germany.

Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, Stockholm, Sweden.

出版信息

Genome Med. 2025 Aug 4;17(1):84. doi: 10.1186/s13073-025-01517-6.


DOI:10.1186/s13073-025-01517-6
PMID:40759979
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12323237/
Abstract

BACKGROUND: Machine learning and artificial intelligence are increasingly being applied to identify phenotypically causal genetic variation. These data-driven methods require comprehensive training sets to deliver reliable results. However, large unbiased datasets for variant prioritization and effect predictions are rare as most of the available databases do not represent a broad ensemble of variant effects and are often biased towards the protein-coding genome, or even towards few well-studied genes. METHODS: To overcome these issues, we propose several alternative training sets derived from subsets of human standing variation. Specifically, we use variants identified from whole-genome sequences of 71,156 individuals contained in gnomAD v3.0 and approximate the benign set with frequent standing variation and the deleterious set with rare or singleton variation. We apply the Combined Annotation Dependent Depletion framework (CADD) and train several alternative models using CADD v1.6. RESULTS: Using the NCBI ClinVar validation set, we demonstrate that the alternative models have state-of-the-art accuracy, globally on par with deleteriousness scores of CADD v1.6 and v1.7, but also outperforming them in certain genomic regions. Being larger than conventional training datasets, including the evolutionary-derived training dataset of about 30 million variants in CADD, standing variation datasets cover a broader range of genomic regions and rare instances of the applied annotations. For example, they cover more recent evolutionary changes common in gene regulatory regions, which are more challenging to assess with conventional tools. CONCLUSIONS: Standing variation allows us to directly train state-of-the-art models for genome-wide variant prioritization or to augment evolutionary-derived variants in training. The proposed datasets have several advantages, like being substantially larger and potentially less biased. Datasets derived from standing variation represent natural allelic changes in the human genome and do not require extensive simulations and adaptations to annotations of evolutionary-derived sequence alterations used for CADD training. We provide datasets as well as trained models to the community for further development and application.

摘要

背景:机器学习和人工智能越来越多地应用于识别表型因果遗传变异。这些数据驱动的方法需要全面的训练集才能产生可靠的结果。然而,用于变异优先级排序和效应预测的大型无偏数据集很少见,因为大多数可用数据库不能代表广泛的变异效应集合,并且往往偏向于蛋白质编码基因组,甚至偏向于少数研究充分的基因。 方法:为了克服这些问题,我们提出了几个源自人类固定变异子集的替代训练集。具体而言,我们使用从gnomAD v3.0中包含的71156个人的全基因组序列中识别出的变异,并将常见的固定变异近似为良性集,将罕见或单例变异近似为有害集。我们应用联合注释依赖损耗框架(CADD),并使用CADD v1.6训练了几个替代模型。 结果:使用NCBI ClinVar验证集,我们证明替代模型具有一流的准确性,总体上与CADD v1.6和v1.7的有害性评分相当,但在某些基因组区域也优于它们。固定变异数据集比传统训练数据集更大,包括CADD中约3000万个变异的进化衍生训练数据集,它覆盖了更广泛的基因组区域和应用注释的罕见实例。例如,它们涵盖了基因调控区域中常见的更新的进化变化,而用传统工具评估这些变化更具挑战性。 结论:固定变异使我们能够直接训练用于全基因组变异优先级排序的一流模型,或在训练中增加进化衍生的变异。所提出的数据集有几个优点,比如规模大得多且可能偏差较小。源自固定变异的数据集代表了人类基因组中的自然等位基因变化,不需要进行广泛的模拟以及调整以适应用于CADD训练的进化衍生序列改变的注释。我们向社区提供数据集以及经过训练的模型,以供进一步开发和应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82ce/12323237/0307a7e4bfbc/13073_2025_1517_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82ce/12323237/5ca80c8c4280/13073_2025_1517_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82ce/12323237/aae86ae52e5a/13073_2025_1517_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82ce/12323237/19e9bb533f4d/13073_2025_1517_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82ce/12323237/83b5e821ab97/13073_2025_1517_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82ce/12323237/4379455358d6/13073_2025_1517_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82ce/12323237/0307a7e4bfbc/13073_2025_1517_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82ce/12323237/5ca80c8c4280/13073_2025_1517_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82ce/12323237/aae86ae52e5a/13073_2025_1517_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82ce/12323237/19e9bb533f4d/13073_2025_1517_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82ce/12323237/83b5e821ab97/13073_2025_1517_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82ce/12323237/4379455358d6/13073_2025_1517_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82ce/12323237/0307a7e4bfbc/13073_2025_1517_Fig6_HTML.jpg

相似文献

[1]
varCADD: large sets of standing genetic variation enable genome-wide pathogenicity prediction.

Genome Med. 2025-8-4

[2]
The quantity, quality and findings of network meta-analyses evaluating the effectiveness of GLP-1 RAs for weight loss: a scoping review.

Health Technol Assess. 2025-6-25

[3]
Short-Term Memory Impairment

2025-1

[4]
A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.

Health Technol Assess. 2001

[5]
Interventions to improve safe and effective medicines use by consumers: an overview of systematic reviews.

Cochrane Database Syst Rev. 2014-4-29

[6]
Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.

Syst Rev. 2024-11-26

[7]
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.

Clin Orthop Relat Res. 2024-12-1

[8]
The Black Book of Psychotropic Dosing and Monitoring.

Psychopharmacol Bull. 2024-7-8

[9]
Sexual Harassment and Prevention Training

2025-1

[10]
Can a Liquid Biopsy Detect Circulating Tumor DNA With Low-passage Whole-genome Sequencing in Patients With a Sarcoma? A Pilot Evaluation.

Clin Orthop Relat Res. 2025-1-1

本文引用的文献

[1]
Machine-guided design of cell-type-targeting cis-regulatory elements.

Nature. 2024-10

[2]
Characterizing the pathogenicity of genetic variants: the consequences of context.

NPJ Genom Med. 2024-1-9

[3]
CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions.

Nucleic Acids Res. 2024-1-5

[4]
The landscape of tolerated genetic variation in humans and primates.

Science. 2023-6-2

[5]
Improving variant calling using population data and deep learning.

BMC Bioinformatics. 2023-5-12

[6]
The landscape of health disparities in the UK Biobank.

Database (Oxford). 2023-4-26

[7]
The Impact of Modern Admixture on Archaic Human Ancestry in Human Populations.

Genome Biol Evol. 2023-5-5

[8]
Genomic Diagnosis of Rare Pediatric Disease in the United Kingdom and Ireland.

N Engl J Med. 2023-4-27

[9]
Gene-environment interactions and their impact on human health.

Genes Immun. 2023-2

[10]
Extreme purifying selection against point mutations in the human genome.

Nat Commun. 2022-7-25

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索