Zhang Xuchao, Chen Jing, Wang Yongtian, Wang Xiaofeng, Hu Jialu, Peng Jiajie, Shang Xuequn, Wang Yanpu, Wang Tao
School of Computer Science, Northwestern Polytechnical University, 1 Dongxiang Rd., Xi'an 710072, Shaanxi, China.
Key Laboratory of Big Data Storage and Management, Ministry of Industry and Information Technology, Northwestern Polytechnical University, 1 Dongxiang Rd., Xi'an 710072, Shaanxi, China.
Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf303.
Cancer remains a significant global health burden, underscoring the need for innovative diagnostic tools to enable early detection and improve patient outcomes. While circulating cell-free DNA (cfDNA) methylation has emerged as a promising biomarker for noninvasive cancer diagnostics, existing methods often face limitations in handling the high-dimensionality of methylation data, small sample sizes, and a lack of biological interpretability. To address these challenges, we propose cfMethylPre, a novel deep transfer learning framework tailored for cancer detection using cfDNA methylation data. cfMethylPre leverages large language model pretrained embeddings from DNA sequence information and integrates them with methylation profiles to enhance feature representation. The deep transfer learning process involves pretraining on bulk DNA methylation data encompassing 2801 samples across 82 cancer types and normal controls, followed by fine-tuning with cfDNA methylation data. This approach ensures robust adaptation to cfDNA's unique characteristics while improving predictive accuracy. Our model achieved superior predictive accuracy compared with state-of-the-art methods, with a weighted Matthews Correlation Coefficient of 0.926 and a weighted F1-score of 0.942. Through model interpretation and biological experimental validation, we identified three novel breast cancer genes-PCDHA10, PRICKLE2, and PRTG-demonstrating their inhibitory effects on cell proliferation and migration in breast cancer cell lines. These findings establish cfMethylPre as a powerful and interpretable tool for cancer diagnostics and biological discovery, paving the way for its application in precision oncology.
癌症仍然是一项重大的全球健康负担,这凸显了对创新诊断工具的需求,以实现早期检测并改善患者预后。虽然循环游离DNA(cfDNA)甲基化已成为无创癌症诊断中有前景的生物标志物,但现有方法在处理甲基化数据的高维度、小样本量以及缺乏生物学可解释性方面常常面临局限性。为应对这些挑战,我们提出了cfMethylPre,这是一种专为使用cfDNA甲基化数据进行癌症检测量身定制的新型深度迁移学习框架。cfMethylPre利用从DNA序列信息预训练的大语言模型嵌入,并将其与甲基化谱整合以增强特征表示。深度迁移学习过程包括在涵盖82种癌症类型和正常对照的2801个样本的大量DNA甲基化数据上进行预训练,随后使用cfDNA甲基化数据进行微调。这种方法确保了对cfDNA独特特征的稳健适应,同时提高了预测准确性。与现有最先进方法相比,我们的模型实现了更高的预测准确性,加权马修斯相关系数为0.926,加权F1分数为0.942。通过模型解释和生物学实验验证,我们鉴定出三个新型乳腺癌基因——PCDHA10、PRICKLE2和PRTG——证明了它们对乳腺癌细胞系中细胞增殖和迁移的抑制作用。这些发现确立了cfMethylPre作为癌症诊断和生物学发现的强大且可解释工具的地位,为其在精准肿瘤学中的应用铺平了道路。