Martí-Gómez Carlos, Zhou Juannan, Chen Wei-Chia, Kinney Justin B, McCandlish David M
Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724.
Department of Biology, University of Florida, Gainesville, FL, 32611.
bioRxiv. 2025 Mar 15:2025.03.09.642267. doi: 10.1101/2025.03.09.642267.
Multiplex assays of variant effect (MAVEs) allow the functional characterization of an unprecedented number of sequence variants in both gene regulatory regions and protein coding sequences. This has enabled the study of nearly complete combinatorial libraries of mutational variants and revealed the widespread influence of higher-order genetic interactions that arise when multiple mutations are combined. However, the lack of appropriate tools for exploratory analysis of this high-dimensional data limits our overall understanding of the main qualitative properties of complex genotype-phenotype maps. To fill this gap, we have developed (https://github.com/cmarti/gpmap-tools), a library that integrates Gaussian process models for inference, phenotypic imputation, and error estimation from incomplete and noisy MAVE data and collections of natural sequences, together with methods for summarizing patterns of higher-order epistasis and non-linear dimensionality reduction techniques that allow visualization of genotype-phenotype maps containing up to millions of genotypes. Here, we used to study the genotype-phenotype map of the Shine-Dalgarno sequence, a motif that modulates binding of the 16S rRNA to the 5' untranslated region (UTR) of mRNAs through base pair complementarity during translation initiation in prokaryotes. We inferred full combinatorial landscapes containing 262,144 different sequences from the sequences of 5,311 5'UTRs in the genome and from experimental MAVE data. Visualizations of the inferred landscapes were largely consistent with each other, and unveiled a simple molecular mechanism underlying the highly epistatic genotype-phenotype map of the Shine-Dalgarno sequence.
变异效应多重分析(MAVEs)能够对基因调控区域和蛋白质编码序列中数量空前的序列变异进行功能表征。这使得对几乎完整的突变变异组合文库进行研究成为可能,并揭示了多个突变组合时出现的高阶遗传相互作用的广泛影响。然而,缺乏用于对这种高维数据进行探索性分析的合适工具,限制了我们对复杂基因型-表型图谱主要定性特征的全面理解。为了填补这一空白,我们开发了(https://github.com/cmarti/gpmap-tools)一个库,该库集成了高斯过程模型,用于从不完整和有噪声的MAVE数据以及自然序列集合中进行推理、表型插补和误差估计,同时还包括用于总结高阶上位性模式的方法和非线性降维技术,这些技术能够可视化包含多达数百万个基因型的基因型-表型图谱。在这里,我们使用该库来研究Shine-Dalgarno序列的基因型-表型图谱,该基序在原核生物翻译起始过程中通过碱基对互补性调节16S rRNA与mRNA的5'非翻译区(UTR)的结合。我们从基因组中5311个5'UTR的序列和实验性MAVE数据中推断出包含262,144个不同序列的完整组合景观。推断出的景观可视化结果在很大程度上相互一致,并揭示了Shine-Dalgarno序列高度上位性基因型-表型图谱背后的简单分子机制。