Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA.
Tatta Bio, Baltimore, MD, USA.
Nat Commun. 2024 Apr 3;15(1):2880. doi: 10.1038/s41467-024-46947-9.
Deciphering the relationship between a gene and its genomic context is fundamental to understanding and engineering biological systems. Machine learning has shown promise in learning latent relationships underlying the sequence-structure-function paradigm from massive protein sequence datasets. However, to date, limited attempts have been made in extending this continuum to include higher order genomic context information. Evolutionary processes dictate the specificity of genomic contexts in which a gene is found across phylogenetic distances, and these emergent genomic patterns can be leveraged to uncover functional relationships between gene products. Here, we train a genomic language model (gLM) on millions of metagenomic scaffolds to learn the latent functional and regulatory relationships between genes. gLM learns contextualized protein embeddings that capture the genomic context as well as the protein sequence itself, and encode biologically meaningful and functionally relevant information (e.g. enzymatic function, taxonomy). Our analysis of the attention patterns demonstrates that gLM is learning co-regulated functional modules (i.e. operons). Our findings illustrate that gLM's unsupervised deep learning of the metagenomic corpus is an effective and promising approach to encode functional semantics and regulatory syntax of genes in their genomic contexts and uncover complex relationships between genes in a genomic region.
破译基因与其基因组上下文之间的关系对于理解和设计生物系统至关重要。机器学习在从大量蛋白质序列数据集中学习序列-结构-功能范式背后的潜在关系方面显示出了前景。然而,迄今为止,在将这一连续体扩展到包括更高阶的基因组上下文信息方面,尝试有限。进化过程决定了基因在系统发育距离内所处的基因组上下文的特异性,并且可以利用这些新兴的基因组模式来揭示基因产物之间的功能关系。在这里,我们在数百万个宏基因组支架上训练基因组语言模型 (gLM),以学习基因之间潜在的功能和调控关系。gLM 学习上下文化的蛋白质嵌入,这些嵌入既捕捉了基因组上下文,也捕捉了蛋白质序列本身,并编码具有生物学意义和功能相关性的信息(例如酶功能、分类学)。我们对注意力模式的分析表明,gLM 正在学习共同调节的功能模块(即操纵子)。我们的研究结果表明,gLM 对宏基因组语料库的无监督深度学习是一种有效且有前途的方法,可以对基因在其基因组背景下的功能语义和调控语法进行编码,并揭示基因组区域中基因之间的复杂关系。