Badet Thomas, Fouché Simone, Hartmann Fanny E, Zala Marcello, Croll Daniel
Laboratory of Evolutionary Genetics, Institute of Biology, University of Neuchâtel, Neuchâtel, Switzerland.
Plant Pathology, Institute of Integrative Biology, ETH Zurich, Zurich, Switzerland.
Nat Commun. 2021 Jun 10;12(1):3551. doi: 10.1038/s41467-021-23862-x.
Species harbor extensive structural variation underpinning recent adaptive evolution. However, the causality between genomic features and the induction of new rearrangements is poorly established. Here, we analyze a global set of telomere-to-telomere genome assemblies of a fungal pathogen of wheat to establish a nucleotide-level map of structural variation. We show that the recent emergence of pesticide resistance has been disproportionally driven by rearrangements. We use machine learning to train a model on structural variation events based on 30 chromosomal sequence features. We show that base composition and gene density are the major determinants of structural variation. Retrotransposons explain most inversion, indel and duplication events. We apply our model to Arabidopsis thaliana and show that our approach extends to more complex genomes. Finally, we analyze complete genomes of haploid offspring in a four-generation pedigree. Meiotic crossover locations are enriched for new rearrangements consistent with crossovers being mutational hotspots. The model trained on species-wide structural variation accurately predicts the position of >74% of newly generated variants along the pedigree. The predictive power highlights causality between specific sequence features and the induction of chromosomal rearrangements. Our work demonstrates that training sequence-derived models can accurately identify regions of intrinsic DNA instability in eukaryotic genomes.
物种具有广泛的结构变异,这些变异是近期适应性进化的基础。然而,基因组特征与新重排诱导之间的因果关系尚未明确确立。在这里,我们分析了一组全球范围内的小麦真菌病原体的端粒到端粒基因组组装,以建立结构变异的核苷酸水平图谱。我们表明,近期出现的抗药性不成比例地受到重排的驱动。我们使用机器学习基于30个染色体序列特征对结构变异事件训练一个模型。我们表明碱基组成和基因密度是结构变异的主要决定因素。逆转座子解释了大多数倒位、插入缺失和重复事件。我们将我们的模型应用于拟南芥,并表明我们的方法可扩展到更复杂的基因组。最后,我们分析了一个四代谱系中单倍体后代的完整基因组。减数分裂交叉位置富含新的重排,这与交叉是突变热点一致。在全物种结构变异上训练的模型准确预测了沿谱系>74%的新产生变异的位置。预测能力突出了特定序列特征与染色体重排诱导之间的因果关系。我们的工作表明,训练基于序列的模型可以准确识别真核生物基因组中内在DNA不稳定的区域。