Berlin Institute of Health (BIH) at Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany.
Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, 23562 Lübeck, Germany.
Genome Res. 2022 Apr;32(4):766-777. doi: 10.1101/gr.275995.121. Epub 2022 Feb 23.
Although technological advances improved the identification of structural variants (SVs) in the human genome, their interpretation remains challenging. Several methods utilize individual mechanistic principles like the deletion of coding sequence or 3D genome architecture disruptions. However, a comprehensive tool using the broad spectrum of available annotations is missing. Here, we describe CADD-SV, a method to retrieve and integrate a wide set of annotations to predict the effects of SVs. Previously, supervised learning approaches were limited due to a small number and biased set of annotated pathogenic or benign SVs. We overcome this problem by using a surrogate training objective, the Combined Annotation Dependent Depletion (CADD) of functional variants. We use human- and chimpanzee-derived SVs as proxy-neutral and contrast them with matched simulated variants as proxy-deleterious, an approach that has proven powerful for short sequence variants. Our tool computes summary statistics over diverse variant annotations and uses random forest models to prioritize deleterious structural variants. The resulting CADD-SV scores correlate with known pathogenic and rare population variants. We further show that we can prioritize somatic cancer variants as well as noncoding variants known to affect gene expression. We provide a website and offline-scoring tool for easy application of CADD-SV.
尽管技术进步提高了人类基因组中结构变异(SV)的识别能力,但它们的解释仍然具有挑战性。有几种方法利用单个机制原理,如编码序列的缺失或三维基因组结构的破坏。然而,缺乏一种使用广泛可用注释的综合工具。在这里,我们描述了 CADD-SV,这是一种检索和整合广泛注释以预测 SV 影响的方法。以前,由于标记的致病性或良性 SV 数量较少且存在偏差,监督学习方法受到限制。我们通过使用替代训练目标,即功能变体的综合注释依赖耗竭(CADD)来克服这个问题。我们使用人类和黑猩猩来源的 SV 作为代理中立,并将其与匹配的模拟变体进行对比作为代理有害,这种方法已被证明对短序列变体非常有效。我们的工具对各种变体注释进行汇总统计,并使用随机森林模型来优先考虑有害的结构变体。由此产生的 CADD-SV 分数与已知的致病性和罕见的人群变体相关。我们进一步表明,我们可以优先考虑体细胞癌症变体以及已知影响基因表达的非编码变体。我们提供了一个网站和离线评分工具,以便于 CADD-SV 的应用。