Center for Bioinformatics (ZBH), Department of Informatics, Faculty of Mathematics, Informatics and Natural Sciences, Universität Hamburg, 20146 Hamburg, Germany.
Department of Chemistry, University of Bergen, 5007 Bergen, Norway.
Biomolecules. 2019 Jan 24;9(2):43. doi: 10.3390/biom9020043.
Natural products (NPs) remain the most prolific resource for the development of smallmolecule drugs. Here we report a new machine learning approach that allows the identification of natural products with high accuracy. The method also generates similarity maps, which highlight atoms that contribute significantly to the classification of small molecules as a natural product or synthetic molecule. The method can hence be utilized to (i) identify natural products in large molecular libraries, (ii) quantify the natural product-likeness of small molecules, and (iii) visualize atoms in small molecules that are characteristic of natural products or synthetic molecules. The models are based on random forest classifiers trained on data sets consisting of more than 265,000 to 322,000 natural products and synthetic molecules. Two-dimensional molecular descriptors, MACCS keys and Morgan2 fingerprints were explored. On an independent test set the models reached areas under the receiver operating characteristic curve (AUC) of 0.997 and Matthews correlation coefficients (MCCs) of 0.954 and higher. The method was further tested on data from the Dictionary of Natural Products, ChEMBL and other resources. The best-performing models are accessible as a free web service at http://npscout.zbh.uni-hamburg.de/npscout.
天然产物(NPs)仍然是小分子药物开发的最丰富资源。在这里,我们报告了一种新的机器学习方法,可实现高精度地识别天然产物。该方法还生成相似度图,突出对将小分子分类为天然产物或合成分子有重大贡献的原子。因此,该方法可用于(i)在大型分子文库中识别天然产物,(ii)量化小分子的天然产物相似性,以及(iii)可视化小分子中具有天然产物或合成分子特征的原子。这些模型是基于随机森林分类器构建的,训练数据集中包含超过 265000 至 322000 个天然产物和合成分子。我们探索了二维分子描述符、MACCS 键和 Morgan2 指纹。在独立测试集上,这些模型的接收者操作特征曲线(ROC)下面积(AUC)达到 0.997,马修斯相关系数(MCC)达到 0.954 及以上。该方法还在天然产物词典、ChEMBL 和其他资源的数据上进行了测试。性能最佳的模型可在免费的网络服务 http://npscout.zbh.uni-hamburg.de/npscout 上获得。