Matsen Frederick A, Sung Kevin, Johnson Mackenzie M, Dumm Will, Rich David, Starr Tyler N, Song Yun S, Bradley Philip, Fukuyama Julia, Haddox Hugh K
Computational Biology Program, Fred Hutchinson Cancer Center, Seattle, WA 98109, USA.
Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.
Mol Biol Evol. 2025 Jul 30;42(8). doi: 10.1093/molbev/msaf186.
During affinity maturation, antibodies are selected for their ability to fold and to bind a target antigen between rounds of somatic hypermutation. Previous studies have identified patterns of selection in antibodies using B cell repertoire sequencing data. However, these studies are constrained by needing to group many sequences or sites to make aggregate predictions. In this paper, we develop a transformer-encoder selection model of maximum resolution: given a single antibody sequence, it predicts the strength of selection on each amino acid site. Specifically, the model predicts for each site whether evolution will be slower than expected relative to a model of the neutral mutation process (purifying selection) or faster than expected (diversifying selection). We show that the model does an excellent job of modeling the process of natural selection on held out data, and does not need to be enormous or trained on vast amounts of data to perform well. The patterns of purifying vs diversifying natural selection do not neatly partition into the complementarity-determining vs framework regions: for example, there are many sites in framework that experience strong diversifying selection. There is a weak correlation between selection factors and solvent accessibility. When considering evolutionary shifts down a tree of antibody evolution, affinity maturation generally shifts sites towards purifying natural selection, however this effect depends on the region, with the biggest shifts toward purifying selection happening in the third complementarity-determining region. We observe distinct evolution between gene families but a limited relationship between germline diversity and selection strength.
在亲和力成熟过程中,抗体在体细胞高频突变轮次之间根据其折叠能力和结合靶抗原的能力进行选择。先前的研究利用B细胞库测序数据确定了抗体中的选择模式。然而,这些研究受到需要对许多序列或位点进行分组以进行总体预测的限制。在本文中,我们开发了一种具有最高分辨率的变压器编码器选择模型:给定单个抗体序列,它可以预测每个氨基酸位点的选择强度。具体而言,该模型针对每个位点预测相对于中性突变过程模型(纯化选择),进化是否会比预期慢(纯化选择)或比预期快(多样化选择)。我们表明,该模型在对留出数据的自然选择过程进行建模方面表现出色,并且不需要非常庞大或在大量数据上进行训练就能表现良好。纯化与多样化自然选择的模式并没有整齐地划分为互补决定区与框架区:例如,框架区中有许多位点经历强烈的多样化选择。选择因素与溶剂可及性之间存在微弱的相关性。当考虑沿着抗体进化树向下的进化变化时,亲和力成熟通常会使位点朝着纯化自然选择转变,然而这种效应取决于区域,朝着纯化选择的最大转变发生在第三个互补决定区。我们观察到基因家族之间存在明显的进化,但种系多样性与选择强度之间的关系有限。