Bioinformatics Group, Wageningen University, Wageningen, the Netherlands.
Netherlands eScience Center, Amsterdam, the Netherlands.
PLoS Comput Biol. 2023 Feb 9;19(2):e1010462. doi: 10.1371/journal.pcbi.1010462. eCollection 2023 Feb.
Microbial specialised metabolism is full of valuable natural products that are applied clinically, agriculturally, and industrially. The genes that encode their biosynthesis are often physically clustered on the genome in biosynthetic gene clusters (BGCs). Many BGCs consist of multiple groups of co-evolving genes called sub-clusters that are responsible for the biosynthesis of a specific chemical moiety in a natural product. Sub-clusters therefore provide an important link between the structures of a natural product and its BGC, which can be leveraged for predicting natural product structures from sequence, as well as for linking chemical structures and metabolomics-derived mass features to BGCs. While some initial computational methodologies have been devised for sub-cluster detection, current approaches are not scalable, have only been run on small and outdated datasets, or produce an impractically large number of possible sub-clusters to mine through. Here, we constructed a scalable method for unsupervised sub-cluster detection, called iPRESTO, based on topic modelling and statistical analysis of co-occurrence patterns of enzyme-coding protein families. iPRESTO was used to mine sub-clusters across 150,000 prokaryotic BGCs from antiSMASH-DB. After annotating a fraction of the resulting sub-cluster families, we could predict a substructure for 16% of the antiSMASH-DB BGCs. Additionally, our method was able to confirm 83% of the experimentally characterised sub-clusters in MIBiG reference BGCs. Based on iPRESTO-detected sub-clusters, we could correctly identify the BGCs for xenorhabdin and salbostatin biosynthesis (which had not yet been annotated in BGC databases), as well as propose a candidate BGC for akashin biosynthesis. Additionally, we show for a collection of 145 actinobacteria how substructures can aid in linking BGCs to molecules by correlating iPRESTO-detected sub-clusters to MS/MS-derived Mass2Motifs substructure patterns. This work paves the way for deeper functional and structural annotation of microbial BGCs by improved linking of orphan molecules to their cognate gene clusters, thus facilitating accelerated natural product discovery.
微生物特化代谢充满了有价值的天然产物,这些产物在临床、农业和工业中得到应用。编码其生物合成的基因通常在基因组上物理聚集在生物合成基因簇 (BGCs) 中。许多 BGC 由多个协同进化的基因群组成,称为亚簇,负责天然产物中特定化学部分的生物合成。因此,亚簇为天然产物的结构与其 BGC 之间提供了重要的联系,这可以用于从序列预测天然产物的结构,以及将化学结构和代谢组学衍生的质量特征与 BGC 联系起来。虽然已经设计了一些初始的计算方法来检测亚簇,但当前的方法不可扩展,仅在小型和过时的数据集上运行,或者产生大量不切实际的可能亚簇来挖掘。在这里,我们构建了一种基于主题建模和酶编码蛋白家族共现模式的统计分析的无监督亚簇检测方法,称为 iPRESTO。iPRESTO 用于挖掘来自 antiSMASH-DB 的 150,000 个原核 BGC 中的亚簇。在注释一部分由此产生的亚簇家族之后,我们可以预测 16%的 antiSMASH-DB BGC 的亚结构。此外,我们的方法能够确认 MIBiG 参考 BGC 中 83%的实验表征的亚簇。基于 iPRESTO 检测到的亚簇,我们可以正确识别 xenorhabdin 和 salbostatin 生物合成的 BGC(这些 BGC 尚未在 BGC 数据库中注释),并提出 akashin 生物合成的候选 BGC。此外,我们展示了 145 种放线菌如何通过将 iPRESTO 检测到的亚簇与 MS/MS 衍生的 Mass2Motifs 亚结构模式相关联,来帮助将 BGC 与分子联系起来。这项工作通过改进将孤儿分子与其同源基因簇联系起来,为微生物 BGC 的更深入功能和结构注释铺平了道路,从而促进了加速天然产物的发现。