Liu Jie, Fan Henghui, Cheng Na, Su Yansen, Xia Junfeng
Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, Anhui, China.
School of Biomedical Engineering, Anhui Medical University, Hefei, 230032, Anhui, China.
BMC Biol. 2025 Aug 8;23(1):250. doi: 10.1186/s12915-025-02348-y.
BACKGROUND: Start loss variants are a class of genetic variants that affect the bases of the start codon, disrupting the normal translation initiation process and leading to protein deletions or the production of different proteins. Accurate assessment of the pathogenicity of these variants is crucial for deciphering disease mechanisms and integrating genomics into clinical practice. However, among the tens of thousands of start loss variants in the human genome, only about 1% have been classified as pathogenic or benign. Computational methods that rely solely on small amounts of labeled data often lack sufficient generalization capabilities, restricting their effectiveness in predicting the impact of start loss variants. RESULTS: Here, we introduce StartCLR, a novel prediction method specifically designed for identifying pathogenic start loss variants. StartCLR captures variant context information from different dimensions by integrating embedding features from diverse DNA language models. Moreover, it employs self-supervised pre-training combined with supervised fine-tuning, enabling the effective utilization of both a large amount of unlabeled data and a small amount of labeled data to enhance prediction accuracy. Our experimental results show that StartCLR exhibits strong generalization and superior prediction performance across different test sets. Notably, when trained exclusively on high-confidence labeled data, StartCLR retains or even improves the prediction accuracy despite the reduced amount of labeled data. CONCLUSIONS: Collectively, these findings highlight the potential of integrating self-supervised contrastive learning with unlabeled data to mitigate the challenge posed by the scarcity of labeled start loss variants.
背景:起始密码子缺失变异是一类影响起始密码子碱基的基因变异,会破坏正常的翻译起始过程,导致蛋白质缺失或产生不同的蛋白质。准确评估这些变异的致病性对于解读疾病机制以及将基因组学应用于临床实践至关重要。然而,在人类基因组中成千上万的起始密码子缺失变异中,只有约1%被分类为致病或良性。仅依赖少量标记数据的计算方法往往缺乏足够的泛化能力,限制了它们在预测起始密码子缺失变异影响方面的有效性。 结果:在此,我们介绍了StartCLR,这是一种专门设计用于识别致病性起始密码子缺失变异的新型预测方法。StartCLR通过整合来自不同DNA语言模型的嵌入特征,从不同维度捕获变异上下文信息。此外,它采用自监督预训练与监督微调相结合的方式,能够有效利用大量未标记数据和少量标记数据来提高预测准确性。我们的实验结果表明,StartCLR在不同测试集上表现出强大的泛化能力和卓越的预测性能。值得注意的是,当仅在高置信度标记数据上进行训练时,尽管标记数据量减少,StartCLR仍保持甚至提高了预测准确性。 结论:总体而言,这些发现凸显了将自监督对比学习与未标记数据相结合以应对起始密码子缺失变异标记数据稀缺所带来挑战的潜力。
IEEE J Biomed Health Inform. 2025-1
Med Image Anal. 2024-5
J Med Imaging (Bellingham). 2025-11
Clin Orthop Relat Res. 2024-9-1
Nat Commun. 2024-10-30
IEEE J Biomed Health Inform. 2025-2
Nature. 2024-9
Signal Transduct Target Ther. 2024-2-23