Talwar James V, Klie Adam, Pagadala Meghana S, Pasternak Gil, Rose Brent, Seibert Tyler M, Gymrek Melissa, Carter Hannah
Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.
Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA.
medRxiv. 2025 May 18:2025.05.16.25327672. doi: 10.1101/2025.05.16.25327672.
Polygenic risk scores (PRSs) serve as quantitative metrics of genetic liability for various conditions. Traditionally calculated as an effect size weighted genotype summation, this formulation assumes conditional feature independence and overlooks the potential for complex interactions among genetic variants. Transformers, a class of deep learning architectures known for capturing dependencies between features, have demonstrated remarkable predictive power across domains. In this work, we introduce VADEr, a Vision Transformer (ViT)-inspired architecture that combines techniques from both natural language processing and computer vision to capture properties exhibited by genetic data and model local and global interactions for genotype-to-phenotype prediction. Evaluating VADEr's performance in predicting prostate cancer (PCa) risk, we found that across a range of metrics, including accuracy, average precision, and Matthews correlation coefficient, VADEr outperformed all benchmark methods, demonstrating its effectiveness in the context of complex disease risk prediction. To illuminate identified drivers of disease risk by VADEr, we formulated DARTH scores, an attention-based attribution metric, to capture the personalized contribution of each genomic region. These scores revealed distinct genetic heterogeneity captured by VADEr, with drivers of predicted risk identified in key PCa risk regions including the , , and loci. DARTH scores also revealed germline predispositions for particular PCa molecular subtypes, including an association between the locus and the subtype, both implicated in the regulation of androgen receptor activity. Overall, by effectively capturing dependencies among genetic variants and providing interpretable insights, VADEr and DARTH scores offer a promising direction for advancing genotype-to-phenotype prediction, particularly in complex disease.
多基因风险评分(PRSs)作为各种疾病遗传易感性的定量指标。传统上,它是通过效应大小加权的基因型总和来计算的,这种公式假定条件特征独立性,并且忽略了基因变异之间复杂相互作用的可能性。Transformer是一类以捕捉特征之间的依赖性而闻名的深度学习架构,在各个领域都展现出了卓越的预测能力。在这项工作中,我们引入了VADEr,这是一种受视觉Transformer(ViT)启发的架构,它结合了自然语言处理和计算机视觉技术,以捕捉遗传数据所呈现的特性,并对基因型到表型的预测建立局部和全局相互作用模型。通过评估VADEr在预测前列腺癌(PCa)风险方面的性能,我们发现,在包括准确率、平均精度和马修斯相关系数在内的一系列指标上,VADEr均优于所有基准方法,证明了其在复杂疾病风险预测背景下的有效性。为了阐明VADEr所识别的疾病风险驱动因素,我们制定了DARTH评分,这是一种基于注意力的归因指标,用于捕捉每个基因组区域的个性化贡献。这些评分揭示了VADEr所捕捉到的独特遗传异质性,在关键的PCa风险区域(包括 、 和 位点)中识别出了预测风险的驱动因素。DARTH评分还揭示了特定PCa分子亚型的种系易感性,包括 位点与 亚型之间的关联,两者都与雄激素受体活性的调节有关。总体而言,通过有效捕捉基因变异之间的依赖性并提供可解释的见解,VADEr和DARTH评分在推进基因型到表型预测方面提供了一个有前景的方向,特别是在复杂疾病中。