基于机器学习框架的剪接破坏变异计算方法性能评估及改进

Performance evaluation of computational methods for splice-disrupting variants and improving the performance using the machine learning-based framework.

机构信息

Division of Cardiology, Department of Internal Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology and Hubei Key Laboratory of Genetics and Molecular Mechanisms of Cardiological Disorders, Wuhan 430030, China.

出版信息

Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac334.

DOI:10.1093/bib/bbac334

PMID:35976049

Abstract

A critical challenge in genetic diagnostics is the assessment of genetic variants associated with diseases, specifically variants that fall out with canonical splice sites, by altering alternative splicing. Several computational methods have been developed to prioritize variants effect on splicing; however, performance evaluation of these methods is hampered by the lack of large-scale benchmark datasets. In this study, we employed a splicing-region-specific strategy to evaluate the performance of prediction methods based on eight independent datasets. Under most conditions, we found that dbscSNV-ADA performed better in the exonic region, S-CAP performed better in the core donor and acceptor regions, S-CAP and SpliceAI performed better in the extended acceptor region and MMSplice performed better in identifying variants that caused exon skipping. However, it should be noted that the performances of prediction methods varied widely under different datasets and splicing regions, and none of these methods showed the best overall performance with all datasets. To address this, we developed a new method, machine learning-based classification of splice sites variants (MLCsplice), to predict variants effect on splicing based on individual methods. We demonstrated that MLCsplice achieved stable and superior prediction performance compared with any individual method. To facilitate the identification of the splicing effect of variants, we provided precomputed MLCsplice scores for all possible splice sites variants across human protein-coding genes (http://39.105.51.3:8090/MLCsplice/). We believe that the performance of different individual methods under eight benchmark datasets will provide tentative guidance for appropriate method selection to prioritize candidate splice-disrupting variants, thereby increasing the genetic diagnostic yield.

摘要

在遗传诊断中，一个关键的挑战是评估与疾病相关的遗传变异，特别是那些通过改变选择性剪接而偏离经典剪接位点的变异。已经开发了几种计算方法来优先考虑变异对剪接的影响；然而，由于缺乏大规模的基准数据集，这些方法的性能评估受到了阻碍。在这项研究中，我们采用了一种剪接区域特异性策略，基于八个独立数据集来评估预测方法的性能。在大多数情况下，我们发现 dbscSNV-ADA 在exon 区域表现更好，S-CAP 在核心 donor 和 acceptor 区域表现更好，S-CAP 和 SpliceAI 在扩展 acceptor 区域表现更好，MMSplice 在识别导致exon 跳过的变异方面表现更好。然而，值得注意的是，在不同的数据集和剪接区域下，预测方法的性能差异很大，没有一种方法在所有数据集上都表现出最佳的整体性能。为了解决这个问题，我们开发了一种新的方法，基于机器学习的剪接位点变异分类（MLCsplice），根据个体方法来预测变异对剪接的影响。我们证明，与任何单个方法相比，MLCsplice 实现了稳定且优越的预测性能。为了方便识别变异的剪接效应，我们为人类蛋白质编码基因中的所有可能剪接位点变异提供了预先计算的 MLCsplice 分数（http://39.105.51.3:8090/MLCsplice/）。我们相信，在八个基准数据集下不同个体方法的性能将为适当的方法选择提供初步指导，以优先考虑候选剪接破坏变异，从而提高遗传诊断的产量。