Computational Biology, School of Computer Science, Carnegie Mellon, Pittsburgh, PA 15213, United States.
Institute for Protein Design, University of Washington, Seattle, WA 8195, United States.
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i40-i46. doi: 10.1093/bioinformatics/btad235.
Microbial natural products represent a major source of bioactive compounds for drug discovery. Among these molecules, nonribosomal peptides (NRPs) represent a diverse class that include antibiotics, immunosuppressants, anticancer agents, toxins, siderophores, pigments, and cytostatics. The discovery of novel NRPs remains a laborious process because many NRPs consist of nonstandard amino acids that are assembled by nonribosomal peptide synthetases (NRPSs). Adenylation domains (A-domains) in NRPSs are responsible for selection and activation of monomers appearing in NRPs. During the past decade, several support vector machine-based algorithms have been developed for predicting the specificity of the monomers present in NRPs. These algorithms utilize physiochemical features of the amino acids present in the A-domains of NRPSs. In this article, we benchmarked the performance of various machine learning algorithms and features for predicting specificities of NRPSs and we showed that the extra trees model paired with one-hot encoding features outperforms the existing approaches. Moreover, we show that unsupervised clustering of 453 560 A-domains reveals many clusters that correspond to potentially novel amino acids. While it is challenging to predict the chemical structure of these amino acids, we developed novel techniques to predict their various properties, including polarity, hydrophobicity, charge, and presence of aromatic rings, carboxyl, and hydroxyl groups.
微生物天然产物是药物发现中生物活性化合物的主要来源。在这些分子中,非核糖体肽(NRP)是一个多样化的类别,包括抗生素、免疫抑制剂、抗癌剂、毒素、铁载体、色素和细胞抑制剂。新型 NRP 的发现仍然是一个艰苦的过程,因为许多 NRP 由非核糖体肽合成酶(NRPS)组装的非标准氨基酸组成。NRPS 中的腺苷酸结构域(A 结构域)负责选择和激活 NRP 中出现的单体。在过去的十年中,已经开发了几种基于支持向量机的算法来预测 NRP 中单体的特异性。这些算法利用 NRPSs 的 A 结构域中存在的氨基酸的物理化学特性。在本文中,我们对各种机器学习算法和用于预测 NRPS 特异性的特征的性能进行了基准测试,结果表明,与现有方法相比,带有独热编码特征的 Extra Trees 模型表现更好。此外,我们还表明,对 453560 个 A 结构域的无监督聚类揭示了许多与潜在新型氨基酸对应的聚类。虽然预测这些氨基酸的化学结构具有挑战性,但我们开发了预测其各种性质的新技术,包括极性、疏水性、电荷以及芳环、羧基和羟基的存在。