Ding Maolin, Chen Ken, Yang Yuedong, Zhao Huiying
School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, 510000, China.
Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-Sen University), Ministry of Education, Guangzhou, China.
Hum Genet. 2025 Mar;144(2-3):253-263. doi: 10.1007/s00439-024-02667-0. Epub 2024 Apr 4.
Genetic diseases are mostly implicated with genetic variants, including missense, synonymous, non-sense, and copy number variants. These different kinds of variants are indicated to affect phenotypes in various ways from previous studies. It remains essential but challenging to understand the functional consequences of these genetic variants, especially the noncoding ones, due to the lack of corresponding annotations. While many computational methods have been proposed to identify the risk variants. Most of them have only curated DNA-level and protein-level annotations to predict the pathogenicity of the variants, and others have been restricted to missense variants exclusively. In this study, we have curated DNA-, RNA-, and protein-level features to discriminate disease-causing variants in both coding and noncoding regions, where the features of protein sequences and protein structures have been shown essential for analyzing missense variants in coding regions while the features related to RNA-splicing and RBP binding are significant for variants in noncoding regions and synonymous variants in coding regions. Through the integration of these features, we have formulated the Multi-level feature Genomic Variants Predictor (ML-GVP) using the gradient boosting tree. The method has been trained on more than 400,000 variants in the Sherloc-training set from the 6th critical assessment of genome interpretation with superior performance. The method is one of the two best-performing predictors on the blind test in the Sherloc assessment, and is further confirmed by another independent test dataset of de novo variants.
遗传疾病大多与基因变异有关,包括错义变异、同义变异、无义变异和拷贝数变异。以往的研究表明,这些不同类型的变异以多种方式影响表型。由于缺乏相应的注释,了解这些基因变异,尤其是非编码变异的功能后果仍然至关重要但具有挑战性。虽然已经提出了许多计算方法来识别风险变异。其中大多数只整理了DNA水平和蛋白质水平的注释来预测变异的致病性,而其他方法则仅限于错义变异。在本研究中,我们整理了DNA、RNA和蛋白质水平的特征,以区分编码区和非编码区的致病变异,其中蛋白质序列和蛋白质结构的特征已被证明对分析编码区的错义变异至关重要,而与RNA剪接和RBP结合相关的特征对非编码区的变异和编码区的同义变异具有重要意义。通过整合这些特征,我们使用梯度提升树构建了多层次特征基因组变异预测器(ML-GVP)。该方法在来自第六届基因组解释关键评估的Sherloc训练集中的40多万个变异上进行了训练,性能优异。该方法是Sherloc评估中盲测中表现最好的两个预测器之一,并通过另一个独立的新生变异测试数据集得到进一步证实。