Department of Genetics.
Department of Bacteriology and J. F. Crow Institute for the Study of Evolution, University of Wisconsin-Madison, Madison, WI 53706, USA.
Bioinformatics. 2017 Oct 15;33(20):3202-3210. doi: 10.1093/bioinformatics/btx400.
Nonribosomally synthesized peptides (NRPs) are natural products with widespread applications in medicine and biotechnology. Many algorithms have been developed to predict the substrate specificities of nonribosomal peptide synthetase adenylation (A) domains from DNA sequences, which enables prioritization and dereplication, and integration with other data types in discovery efforts. However, insufficient training data and a lack of clarity regarding prediction quality have impeded optimal use. Here, we introduce prediCAT, a new phylogenetics-inspired algorithm, which quantitatively estimates the degree of predictability of each A-domain. We then systematically benchmarked all algorithms on a newly gathered, independent test set of 434 A-domain sequences, showing that active-site-motif-based algorithms outperform whole-domain-based methods. Subsequently, we developed SANDPUMA, a powerful ensemble algorithm, based on newly trained versions of all high-performing algorithms, which significantly outperforms individual methods. Finally, we deployed SANDPUMA in a systematic investigation of 7635 Actinobacteria genomes, suggesting that NRP chemical diversity is much higher than previously estimated. SANDPUMA has been integrated into the widely used antiSMASH biosynthetic gene cluster analysis pipeline and is also available as an open-source, standalone tool.
SANDPUMA is freely available at https://bitbucket.org/chevrm/sandpuma and as a docker image at https://hub.docker.com/r/chevrm/sandpuma/ under the GNU Public License 3 (GPL3).
chevrette@wisc.edu or marnix.medema@wur.nl.
Supplementary data are available at Bioinformatics online.
非核糖体合成肽(NRP)是具有广泛医学和生物技术应用的天然产物。许多算法已被开发出来,用于从 DNA 序列预测非核糖体肽合成酶腺苷酸化(A)结构域的底物特异性,这使得能够在发现工作中进行优先级排序和去重复,并与其他数据类型集成。然而,训练数据不足和预测质量不明确阻碍了最佳使用。在这里,我们引入了 prediCAT,一种新的基于系统发生的算法,它定量估计每个 A 结构域的可预测性程度。然后,我们在一个新收集的、独立的 434 个 A 结构域序列测试集上系统地对所有算法进行了基准测试,结果表明基于活性位点基序的算法优于基于整个结构域的方法。随后,我们基于所有高性能算法的新训练版本开发了 SANDPUMA,这是一种强大的集成算法,显著优于单个方法。最后,我们在对 7635 个放线菌基因组的系统研究中部署了 SANDPUMA,表明 NRP 化学多样性比以前估计的要高得多。SANDPUMA 已集成到广泛使用的 antiSMASH 生物合成基因簇分析管道中,也可作为一个开源的独立工具使用。
SANDPUMA 可在 https://bitbucket.org/chevrm/sandpuma 上免费获得,并可在 https://hub.docker.com/r/chevrm/sandpuma/ 作为 docker 镜像获得,许可证为 GNU 公共许可证 3(GPL3)。
chevrette@wisc.edu 或 marnix.medema@wur.nl。
补充数据可在《生物信息学》在线获取。