Suppr超能文献

基于极端梯度提升算法,通过DNA、RNA和蛋白质水平特征对基因组变异的致病性进行优先级排序。

Prioritizing genomic variants pathogenicity via DNA, RNA, and protein-level features based on extreme gradient boosting.

作者信息

Ding Maolin, Chen Ken, Yang Yuedong, Zhao Huiying

机构信息

School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, 510000, China.

Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-Sen University), Ministry of Education, Guangzhou, China.

出版信息

Hum Genet. 2025 Mar;144(2-3):253-263. doi: 10.1007/s00439-024-02667-0. Epub 2024 Apr 4.

Abstract

Genetic diseases are mostly implicated with genetic variants, including missense, synonymous, non-sense, and copy number variants. These different kinds of variants are indicated to affect phenotypes in various ways from previous studies. It remains essential but challenging to understand the functional consequences of these genetic variants, especially the noncoding ones, due to the lack of corresponding annotations. While many computational methods have been proposed to identify the risk variants. Most of them have only curated DNA-level and protein-level annotations to predict the pathogenicity of the variants, and others have been restricted to missense variants exclusively. In this study, we have curated DNA-, RNA-, and protein-level features to discriminate disease-causing variants in both coding and noncoding regions, where the features of protein sequences and protein structures have been shown essential for analyzing missense variants in coding regions while the features related to RNA-splicing and RBP binding are significant for variants in noncoding regions and synonymous variants in coding regions. Through the integration of these features, we have formulated the Multi-level feature Genomic Variants Predictor (ML-GVP) using the gradient boosting tree. The method has been trained on more than 400,000 variants in the Sherloc-training set from the 6th critical assessment of genome interpretation with superior performance. The method is one of the two best-performing predictors on the blind test in the Sherloc assessment, and is further confirmed by another independent test dataset of de novo variants.

摘要

遗传疾病大多与基因变异有关,包括错义变异、同义变异、无义变异和拷贝数变异。以往的研究表明,这些不同类型的变异以多种方式影响表型。由于缺乏相应的注释,了解这些基因变异,尤其是非编码变异的功能后果仍然至关重要但具有挑战性。虽然已经提出了许多计算方法来识别风险变异。其中大多数只整理了DNA水平和蛋白质水平的注释来预测变异的致病性,而其他方法则仅限于错义变异。在本研究中,我们整理了DNA、RNA和蛋白质水平的特征,以区分编码区和非编码区的致病变异,其中蛋白质序列和蛋白质结构的特征已被证明对分析编码区的错义变异至关重要,而与RNA剪接和RBP结合相关的特征对非编码区的变异和编码区的同义变异具有重要意义。通过整合这些特征,我们使用梯度提升树构建了多层次特征基因组变异预测器(ML-GVP)。该方法在来自第六届基因组解释关键评估的Sherloc训练集中的40多万个变异上进行了训练,性能优异。该方法是Sherloc评估中盲测中表现最好的两个预测器之一,并通过另一个独立的新生变异测试数据集得到进一步证实。

相似文献

1
Prioritizing genomic variants pathogenicity via DNA, RNA, and protein-level features based on extreme gradient boosting.
Hum Genet. 2025 Mar;144(2-3):253-263. doi: 10.1007/s00439-024-02667-0. Epub 2024 Apr 4.
2
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.
Cochrane Database Syst Rev. 2017 Dec 22;12(12):CD011535. doi: 10.1002/14651858.CD011535.pub2.
3
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
6
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
7
Antidepressants for pain management in adults with chronic pain: a network meta-analysis.
Health Technol Assess. 2024 Oct;28(62):1-155. doi: 10.3310/MKRT2948.
8
Interventions for central serous chorioretinopathy: a network meta-analysis.
Cochrane Database Syst Rev. 2025 Jun 16;6(6):CD011841. doi: 10.1002/14651858.CD011841.pub3.
10
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

本文引用的文献

4
Capturing large genomic contexts for accurately predicting enhancer-promoter interactions.
Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbab577.
5
Disease variant prediction with deep generative models of evolutionary data.
Nature. 2021 Nov;599(7883):91-95. doi: 10.1038/s41586-021-04043-8. Epub 2021 Oct 27.
6
Improved pathogenicity prediction for rare human missense variants.
Am J Hum Genet. 2021 Oct 7;108(10):1891-1906. doi: 10.1016/j.ajhg.2021.08.012. Epub 2021 Sep 21.
7
Prioritization of candidate causal genes for asthma in susceptibility loci derived from UK Biobank.
Commun Biol. 2021 Jun 8;4(1):700. doi: 10.1038/s42003-021-02227-6.
8
MVP predicts the pathogenicity of missense variants by deep learning.
Nat Commun. 2021 Jan 21;12(1):510. doi: 10.1038/s41467-020-20847-0.
9
OncoVar: an integrated database and analysis platform for oncogenic driver variants in cancers.
Nucleic Acids Res. 2021 Jan 8;49(D1):D1289-D1301. doi: 10.1093/nar/gkaa1033.
10
Accurate prediction of genome-wide RNA secondary structure profile based on extreme gradient boosting.
Bioinformatics. 2020 Nov 1;36(17):4576-4582. doi: 10.1093/bioinformatics/btaa534.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验