Microsoft Research New England, Cambridge, Massachusetts, United States of America.
Department of Bioengineering, Stanford University, Stanford, California, United States of America.
PLoS Comput Biol. 2023 May 23;19(5):e1011162. doi: 10.1371/journal.pcbi.1011162. eCollection 2023 May.
Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification.
天然产物是构成医药行业许多治疗药物基础的化学化合物。在微生物中,天然产物是由称为生物合成基因簇 (BGC) 的聚集基因合成的。随着高通量测序技术的进步,完整的微生物分离基因组和宏基因组的数量不断增加,其中大量 BGC 尚未被发现。在这里,我们介绍了一种自监督学习方法,旨在从这些数据中识别和表征 BGC。为此,我们将 BGC 表示为功能蛋白结构域的链,并在这些结构域上训练掩蔽语言模型。我们评估了我们的方法在检测细菌基因组中的 BGC 和表征 BGC 属性方面的能力。我们还证明了我们的模型可以学习 BGC 及其组成结构域的有意义表示,检测微生物基因组中的 BGC,并预测 BGC 产物类别。这些结果突出了自监督神经网络作为改进 BGC 预测和分类的有前途的框架。