Li Weijiang, Li Xiaomin, Lavallee Ethan, Saparov Alice, Zitnik Marinka, Cassa Christopher
Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, 02115, MA, United States.
School of Engineering and Applied Sciences, Harvard University, Boston, 02138, MA, United States.
medRxiv. 2024 Dec 31:2024.12.31.24319792. doi: 10.1101/2024.12.31.24319792.
Despite rapid advances in genomic sequencing, most rare genetic variants remain insufficiently characterized for clinical use, limiting the potential of personalized medicine. When classifying whether a variant is pathogenic, clinical labs adhere to diagnostic guidelines that comprehensively evaluate many forms of evidence including case data, computational predictions, and functional screening. While a substantial amount of clinical evidence has been developed for these variants, the majority cannot be definitively classified as 'pathogenic' or 'benign', and thus persist as 'Variants of Uncertain Significance' (VUS). We processed over 2.4 million plaintext variant summaries from ClinVar, employing sentence-level classification to remove content that does not contain evidence and removing uninformative summaries. We developed ClinVar-BERT to discern clinical evidence within these summaries by fine-tuning a BioBERT-based model with labeled records. When validated classifications from this model against orthogonal functional screening data, ClinVar-BERT significantly separated estimates of functional impact in clinically actionable genes, including (p = × ), (p = × ), and (p = × ). Additionally, ClinVar-BERT achieved an AUROC of 0.927 in classifying ClinVar VUS against this functional screening data. This suggests that ClinVar-BERT is capable of discerning evidence from diagnostic reports and can be used to prioritize variants for re-assessment by diagnostic labs and expert curation panels.
尽管基因组测序取得了快速进展,但大多数罕见基因变异在临床应用中的特征仍不够充分,限制了个性化医疗的潜力。在对变异是否致病进行分类时,临床实验室遵循诊断指南,全面评估多种证据形式,包括病例数据、计算预测和功能筛选。虽然已经为这些变异积累了大量临床证据,但大多数变异无法明确归类为“致病”或“良性”,因此仍作为“意义未明的变异”(VUS)存在。我们处理了来自ClinVar的超过240万条纯文本变异摘要,采用句子级分类去除不包含证据的内容,并去除无信息价值的摘要。我们开发了ClinVar-BERT,通过使用标记记录对基于BioBERT的模型进行微调,以辨别这些摘要中的临床证据。当将该模型的验证分类与正交功能筛选数据进行对比时,ClinVar-BERT显著区分了临床可操作基因中功能影响的估计值,包括(p = × )、(p = × )和(p = × )。此外,ClinVar-BERT在根据此功能筛选数据对ClinVar VUS进行分类时,AUROC达到了0.927。这表明ClinVar-BERT能够从诊断报告中辨别证据,可用于为诊断实验室和专家整理小组重新评估变异确定优先级。