Visscher Ellen, Yau Christopher
Nuffield Department for Women's & Reproductive Health, University of Oxford, Women's Centre, John Radcliffe Hospital, Oxford OX3 9DU, United Kingdom.
NAR Genom Bioinform. 2025 Sep 9;7(3):lqaf124. doi: 10.1093/nargab/lqaf124. eCollection 2025 Sep.
Somatic copy number alterations (CNAs) are hallmarks of cancer. Current algorithms that call CNAs from whole-genome sequenced (WGS) data have not exploited deep learning methods owing to computational scaling limitations. Here, we present a novel deep-learning approach, araCNA, trained only on simulated data that can accurately predict CNAs in real WGS cancer genomes. araCNA uses novel transformer alternatives (e.g. Mamba) to handle genomic-scale sequence lengths (∼1M) and learn long-range interactions. Results are extremely accurate on simulated data, and this zero-shot approach is on par with existing methods when applied to 50 WGS samples from the Cancer Genome Atlas. Notably, our approach requires only a tumour sample and not a matched normal sample, has fewer markers of overfitting, and performs inference in only a few minutes. araCNA demonstrates how domain knowledge can be used to simulate training sets that harness the power of modern machine learning in biological applications.
体细胞拷贝数改变(CNA)是癌症的标志。目前从全基因组测序(WGS)数据中识别CNA的算法由于计算规模限制尚未采用深度学习方法。在此,我们提出一种新颖的深度学习方法araCNA,该方法仅在模拟数据上进行训练,能够准确预测真实WGS癌症基因组中的CNA。araCNA使用新颖的变换器替代方案(如Mamba)来处理基因组规模的序列长度(约100万个碱基对)并学习长程相互作用。该方法在模拟数据上的结果极其准确,并且这种零样本方法在应用于来自癌症基因组图谱的50个WGS样本时与现有方法相当。值得注意的是,我们的方法仅需要肿瘤样本而不需要匹配的正常样本,过拟合标记更少,并且仅需几分钟即可完成推理。araCNA展示了如何利用领域知识来模拟训练集,从而在生物应用中发挥现代机器学习的强大功能。