Suppr超能文献

基于自监督对比学习预测人类致病起始缺失变异体。

Prediction of human pathogenic start loss variants based on self-supervised contrastive learning.

作者信息

Liu Jie, Fan Henghui, Cheng Na, Su Yansen, Xia Junfeng

机构信息

Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, Anhui, China.

School of Biomedical Engineering, Anhui Medical University, Hefei, 230032, Anhui, China.

出版信息

BMC Biol. 2025 Aug 8;23(1):250. doi: 10.1186/s12915-025-02348-y.

Abstract

BACKGROUND

Start loss variants are a class of genetic variants that affect the bases of the start codon, disrupting the normal translation initiation process and leading to protein deletions or the production of different proteins. Accurate assessment of the pathogenicity of these variants is crucial for deciphering disease mechanisms and integrating genomics into clinical practice. However, among the tens of thousands of start loss variants in the human genome, only about 1% have been classified as pathogenic or benign. Computational methods that rely solely on small amounts of labeled data often lack sufficient generalization capabilities, restricting their effectiveness in predicting the impact of start loss variants.

RESULTS

Here, we introduce StartCLR, a novel prediction method specifically designed for identifying pathogenic start loss variants. StartCLR captures variant context information from different dimensions by integrating embedding features from diverse DNA language models. Moreover, it employs self-supervised pre-training combined with supervised fine-tuning, enabling the effective utilization of both a large amount of unlabeled data and a small amount of labeled data to enhance prediction accuracy. Our experimental results show that StartCLR exhibits strong generalization and superior prediction performance across different test sets. Notably, when trained exclusively on high-confidence labeled data, StartCLR retains or even improves the prediction accuracy despite the reduced amount of labeled data.

CONCLUSIONS

Collectively, these findings highlight the potential of integrating self-supervised contrastive learning with unlabeled data to mitigate the challenge posed by the scarcity of labeled start loss variants.

摘要

背景

起始密码子缺失变异是一类影响起始密码子碱基的基因变异,会破坏正常的翻译起始过程,导致蛋白质缺失或产生不同的蛋白质。准确评估这些变异的致病性对于解读疾病机制以及将基因组学应用于临床实践至关重要。然而,在人类基因组中成千上万的起始密码子缺失变异中,只有约1%被分类为致病或良性。仅依赖少量标记数据的计算方法往往缺乏足够的泛化能力,限制了它们在预测起始密码子缺失变异影响方面的有效性。

结果

在此,我们介绍了StartCLR,这是一种专门设计用于识别致病性起始密码子缺失变异的新型预测方法。StartCLR通过整合来自不同DNA语言模型的嵌入特征,从不同维度捕获变异上下文信息。此外,它采用自监督预训练与监督微调相结合的方式,能够有效利用大量未标记数据和少量标记数据来提高预测准确性。我们的实验结果表明,StartCLR在不同测试集上表现出强大的泛化能力和卓越的预测性能。值得注意的是,当仅在高置信度标记数据上进行训练时,尽管标记数据量减少,StartCLR仍保持甚至提高了预测准确性。

结论

总体而言,这些发现凸显了将自监督对比学习与未标记数据相结合以应对起始密码子缺失变异标记数据稀缺所带来挑战的潜力。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验