araCNA：使用长程序列模型进行体细胞拷贝数分析

araCNA: somatic copy number profiling using long-range sequence models.

作者信息

Visscher Ellen, Yau Christopher

机构信息

Nuffield Department for Women's & Reproductive Health, University of Oxford, Women's Centre, John Radcliffe Hospital, Oxford OX3 9DU, United Kingdom.

出版信息

NAR Genom Bioinform. 2025 Sep 9;7(3):lqaf124. doi: 10.1093/nargab/lqaf124. eCollection 2025 Sep.

DOI:10.1093/nargab/lqaf124

PMID:40933674

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12418177/

Abstract

Somatic copy number alterations (CNAs) are hallmarks of cancer. Current algorithms that call CNAs from whole-genome sequenced (WGS) data have not exploited deep learning methods owing to computational scaling limitations. Here, we present a novel deep-learning approach, araCNA, trained only on simulated data that can accurately predict CNAs in real WGS cancer genomes. araCNA uses novel transformer alternatives (e.g. Mamba) to handle genomic-scale sequence lengths (∼1M) and learn long-range interactions. Results are extremely accurate on simulated data, and this zero-shot approach is on par with existing methods when applied to 50 WGS samples from the Cancer Genome Atlas. Notably, our approach requires only a tumour sample and not a matched normal sample, has fewer markers of overfitting, and performs inference in only a few minutes. araCNA demonstrates how domain knowledge can be used to simulate training sets that harness the power of modern machine learning in biological applications.

摘要

体细胞拷贝数改变（CNA）是癌症的标志。目前从全基因组测序（WGS）数据中识别CNA的算法由于计算规模限制尚未采用深度学习方法。在此，我们提出一种新颖的深度学习方法araCNA，该方法仅在模拟数据上进行训练，能够准确预测真实WGS癌症基因组中的CNA。araCNA使用新颖的变换器替代方案（如Mamba）来处理基因组规模的序列长度（约100万个碱基对）并学习长程相互作用。该方法在模拟数据上的结果极其准确，并且这种零样本方法在应用于来自癌症基因组图谱的50个WGS样本时与现有方法相当。值得注意的是，我们的方法仅需要肿瘤样本而不需要匹配的正常样本，过拟合标记更少，并且仅需几分钟即可完成推理。araCNA展示了如何利用领域知识来模拟训练集，从而在生物应用中发挥现代机器学习的强大功能。