基于机器学习的聚酮和非核糖体肽大环化模式预测方法。

A machine learning-based method for prediction of macrocyclization patterns of polyketides and non-ribosomal peptides.

机构信息

Bioinformatics Centre, National Institute of Immunology, New Delhi 110067, India.

出版信息

Bioinformatics. 2021 May 5;37(5):603-611. doi: 10.1093/bioinformatics/btaa851.

DOI:10.1093/bioinformatics/btaa851

PMID:33010151

Abstract

MOTIVATION

Even though genome mining tools have successfully identified large numbers of non-ribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) biosynthetic gene clusters (BGCs) in bacterial genomes, currently no tool can predict the chemical structure of the secondary metabolites biosynthesized by these BGCs. Lack of algorithms for predicting complex macrocyclization patterns of linear PK/NRP biosynthetic intermediates has been the major bottleneck in deciphering the final bioactive chemical structures of PKs/NRPs by genome mining.

RESULTS

Using a large dataset of known chemical structures of macrocyclized PKs/NRPs, we have developed a machine learning (ML) algorithm for distinguishing the correct macrocyclization pattern of PKs/NRPs from the library of all theoretically possible cyclization patterns. Benchmarking of this ML classifier on completely independent datasets has revealed ROC-AUC and PR-AUC values of 0.82 and 0.81, respectively. This cyclization prediction algorithm has been used to develop SBSPKSv3, a genome mining tool for completely automated prediction of macrocyclized structures of NRPs/PKs. SBSPKSv3 has been extensively benchmarked on a dataset of over 100 BGCs with known PKs/NRPs products.

AVAILABILITY AND IMPLEMENTATION

The macrocyclization prediction pipeline and all the datasets used in this study are freely available at http://www.nii.ac.in/sbspks3.html.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

尽管基因组挖掘工具已经成功地在细菌基因组中鉴定出了大量的非核糖体肽合成酶（NRPS）和聚酮合酶（PKS）生物合成基因簇（BGC），但目前还没有工具可以预测这些 BGC 合成的次生代谢物的化学结构。缺乏预测线性 PK/NRP 生物合成中间体复杂大环化模式的算法一直是通过基因组挖掘破译 PK/NRPs 的最终生物活性化学结构的主要瓶颈。

结果

使用已知大环化 PK/NRPs 的大量化学结构数据集，我们开发了一种机器学习（ML）算法，用于从所有理论上可能的环化模式库中区分 PK/NRPs 的正确环化模式。该 ML 分类器在完全独立的数据集上的基准测试显示 ROC-AUC 和 PR-AUC 值分别为 0.82 和 0.81。该环化预测算法已被用于开发 SBSPKSv3，这是一种用于完全自动预测 NRPs/PKs 的大环化结构的基因组挖掘工具。SBSPKSv3 已在具有已知 PKs/NRPs 产物的超过 100 个 BGC 的数据集上进行了广泛的基准测试。