基于自监督对比学习预测人类致病起始缺失变异体。

Prediction of human pathogenic start loss variants based on self-supervised contrastive learning.

作者信息

Liu Jie, Fan Henghui, Cheng Na, Su Yansen, Xia Junfeng

机构信息

Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, Anhui, China.

School of Biomedical Engineering, Anhui Medical University, Hefei, 230032, Anhui, China.

出版信息

BMC Biol. 2025 Aug 8;23(1):250. doi: 10.1186/s12915-025-02348-y.

DOI:10.1186/s12915-025-02348-y

PMID:40781627

Abstract

BACKGROUND

Start loss variants are a class of genetic variants that affect the bases of the start codon, disrupting the normal translation initiation process and leading to protein deletions or the production of different proteins. Accurate assessment of the pathogenicity of these variants is crucial for deciphering disease mechanisms and integrating genomics into clinical practice. However, among the tens of thousands of start loss variants in the human genome, only about 1% have been classified as pathogenic or benign. Computational methods that rely solely on small amounts of labeled data often lack sufficient generalization capabilities, restricting their effectiveness in predicting the impact of start loss variants.

RESULTS

Here, we introduce StartCLR, a novel prediction method specifically designed for identifying pathogenic start loss variants. StartCLR captures variant context information from different dimensions by integrating embedding features from diverse DNA language models. Moreover, it employs self-supervised pre-training combined with supervised fine-tuning, enabling the effective utilization of both a large amount of unlabeled data and a small amount of labeled data to enhance prediction accuracy. Our experimental results show that StartCLR exhibits strong generalization and superior prediction performance across different test sets. Notably, when trained exclusively on high-confidence labeled data, StartCLR retains or even improves the prediction accuracy despite the reduced amount of labeled data.

CONCLUSIONS

Collectively, these findings highlight the potential of integrating self-supervised contrastive learning with unlabeled data to mitigate the challenge posed by the scarcity of labeled start loss variants.

摘要

背景

起始密码子缺失变异是一类影响起始密码子碱基的基因变异，会破坏正常的翻译起始过程，导致蛋白质缺失或产生不同的蛋白质。准确评估这些变异的致病性对于解读疾病机制以及将基因组学应用于临床实践至关重要。然而，在人类基因组中成千上万的起始密码子缺失变异中，只有约1%被分类为致病或良性。仅依赖少量标记数据的计算方法往往缺乏足够的泛化能力，限制了它们在预测起始密码子缺失变异影响方面的有效性。

结果

在此，我们介绍了StartCLR，这是一种专门设计用于识别致病性起始密码子缺失变异的新型预测方法。StartCLR通过整合来自不同DNA语言模型的嵌入特征，从不同维度捕获变异上下文信息。此外，它采用自监督预训练与监督微调相结合的方式，能够有效利用大量未标记数据和少量标记数据来提高预测准确性。我们的实验结果表明，StartCLR在不同测试集上表现出强大的泛化能力和卓越的预测性能。值得注意的是，当仅在高置信度标记数据上进行训练时，尽管标记数据量减少，StartCLR仍保持甚至提高了预测准确性。

结论

总体而言，这些发现凸显了将自监督对比学习与未标记数据相结合以应对起始密码子缺失变异标记数据稀缺所带来挑战的潜力。

相似文献

Prediction of human pathogenic start loss variants based on self-supervised contrastive learning.

BMC Biol. 2025 Aug 8;23(1):250. doi: 10.1186/s12915-025-02348-y.

Trajectory-Ordered Objectives for Self-Supervised Representation Learning of Temporal Healthcare Data Using Transformers: Model Development and Evaluation Study.

JMIR Med Inform. 2025 Jun 4;13:e68138. doi: 10.2196/68138.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Self-Supervised Contrastive Learning on Attribute and Topology Graphs for Predicting Relationships Among lncRNAs, miRNAs and Diseases.

IEEE J Biomed Health Inform. 2025 Jan;29(1):657-668. doi: 10.1109/JBHI.2024.3467101. Epub 2025 Jan 7.

Boundary-aware information maximization for self-supervised medical image segmentation.

Med Image Anal. 2024 May;94:103150. doi: 10.1016/j.media.2024.103150. Epub 2024 Mar 28.

An augmented transformer model trained on protein family specific variant data leads to improved prediction of variants of uncertain significance.

Hum Genet. 2025 Mar;144(2-3):143-158. doi: 10.1007/s00439-025-02727-z. Epub 2025 Jan 27.

Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.

Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.

A segment anything model-guided and match-based semi-supervised segmentation framework for medical imaging.

Med Phys. 2025 Mar 29. doi: 10.1002/mp.17785.

Semi-supervised semantic segmentation of cell nuclei with diffusion model and collaborative learning.

J Med Imaging (Bellingham). 2025 Nov;12(6):061403. doi: 10.1117/1.JMI.12.6.061403. Epub 2025 Mar 20.

Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?

Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

本文引用的文献

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants.

Nat Biotechnol. 2025 Jan 2. doi: 10.1038/s41587-024-02511-w.

Self-distillation improves self-supervised learning for DNA sequence inference.

Neural Netw. 2025 Mar;183:106978. doi: 10.1016/j.neunet.2024.106978. Epub 2024 Dec 7.

Foundation models for fast, label-free detection of glioma infiltration.

Nature. 2025 Jan;637(8045):439-445. doi: 10.1038/s41586-024-08169-3. Epub 2024 Nov 13.

A long-context language model for deciphering and generating bacteriophage genomes.

Nat Commun. 2024 Oct 30;15(1):9392. doi: 10.1038/s41467-024-53759-4.

Interpretable Dynamic Directed Graph Convolutional Network for Multi-Relational Prediction of Missense Mutation and Drug Response.

IEEE J Biomed Health Inform. 2025 Feb;29(2):1514-1524. doi: 10.1109/JBHI.2024.3483316. Epub 2025 Feb 10.

Deciphering the impact of genomic variation on function.

Nature. 2024 Sep;633(8028):47-57. doi: 10.1038/s41586-024-07510-0. Epub 2024 Sep 4.

Clustered de novo start-loss variants in GLUL result in a developmental and epileptic encephalopathy via stabilization of glutamine synthetase.

Am J Hum Genet. 2024 Apr 4;111(4):729-741. doi: 10.1016/j.ajhg.2024.03.005.

Protein translation: biological processes and therapeutic strategies for human diseases.

Signal Transduct Target Ther. 2024 Feb 23;9(1):44. doi: 10.1038/s41392-024-01749-9.

CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions.

Nucleic Acids Res. 2024 Jan 5;52(D1):D1143-D1154. doi: 10.1093/nar/gkad989.

Accurate proteome-wide missense variant effect prediction with AlphaMissense.

Science. 2023 Sep 22;381(6664):eadg7492. doi: 10.1126/science.adg7492.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于自监督对比学习预测人类致病起始缺失变异体。

Prediction of human pathogenic start loss variants based on self-supervised contrastive learning.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献