CADD v1.7:利用蛋白质语言模型、调控 CNN 以及其他核苷酸水平的评分来提高全基因组变异预测的准确性。
CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions.
机构信息
Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
Institute of Human Genetics, University Hospital Schleswig-Holstein, University of Lübeck, Lübeck, Germany.
出版信息
Nucleic Acids Res. 2024 Jan 5;52(D1):D1143-D1154. doi: 10.1093/nar/gkad989.
Machine Learning-based scoring and classification of genetic variants aids the assessment of clinical findings and is employed to prioritize variants in diverse genetic studies and analyses. Combined Annotation-Dependent Depletion (CADD) is one of the first methods for the genome-wide prioritization of variants across different molecular functions and has been continuously developed and improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored and integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) and sequence conservation scores (Zoonomia). We evaluated the new version on data sets derived from ClinVar, ExAC/gnomAD and 1000 Genomes variants. For coding effects, we tested CADD on 31 Deep Mutational Scanning (DMS) data sets from ProteinGym and, for regulatory effect prediction, we used saturation mutagenesis reporter assay data of promoter and enhancer sequences. The inclusion of new features further improved the overall performance of CADD. As with previous releases, all data sets, genome-wide CADD v1.7 scores, scripts for on-site scoring and an easy-to-use webserver are readily provided via https://cadd.bihealth.org/ or https://cadd.gs.washington.edu/ to the community.
基于机器学习的遗传变异评分和分类有助于评估临床发现,并用于在各种遗传研究和分析中优先考虑变异。综合注释依赖耗竭(CADD)是一种用于在不同分子功能中对变体进行全基因组优先排序的方法之一,自最初发表以来一直在不断发展和改进。在这里,我们呈现了我们的最新版本 CADD v1.7。我们探索并整合了新的注释特征,其中包括最先进的蛋白质语言模型评分(Meta ESM-1v)、基于序列的卷积神经网络的调控变异效应预测以及序列保守性评分(Zoonomia)。我们在来自 ClinVar、ExAC/gnomAD 和 1000 Genomes 变体的数据集上评估了新版本。对于编码效应,我们在来自 ProteinGym 的 31 个深度突变扫描(DMS)数据集上测试了 CADD,对于调控效应预测,我们使用了启动子和增强子序列的饱和诱变报告基因检测数据。新特征的加入进一步提高了 CADD 的整体性能。与以前的版本一样,所有数据集、全基因组 CADD v1.7 评分、现场评分脚本以及易于使用的网络服务器都可通过 https://cadd.bihealth.org/ 或 https://cadd.gs.washington.edu/ 免费提供给社区。