Pereira Florbela
LAQV and REQUIMTE, Departamento de Química, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516, Caparica, Portugal.
Mol Inform. 2021 Jun;40(6):e2060034. doi: 10.1002/minf.202060034. Epub 2021 Mar 30.
In recent years there has been a growing interest in studying the differences between the chemical and biological space represented by natural products (NPs) of terrestrial and marine origin. In order to learn more about these two chemical spaces, marine natural products (MNPs) and terrestrial natural products (TNPs), a machine learning (ML) approach was developed in the current work to predict three classes, MNPs, TNPs and a third class of NPs that appear in both the terrestrial and marine environments. In total 22,398 NPs were retrieved from the Reaxys® database, from those 10,790 molecules are recorded as MNPs, 10,857 as TNPs, and 761 NPs appear registered as both MNPs and TNPs. Several ML algorithms such as Random Forest, Support Vector Machines, and deep learning Multilayer Perceptron networks have been benchmarked. The best performance was achieved with a consensus classification model, which predicted the external test set with an overall predictive accuracy up to 81 %. As far as we know this approach has never been intended and therefore allow to be used to better understand the chemical space defined by MNPs, TNPs or both, but also in virtual screening to define the applicability domain of QSAR models of MNPs and TNPs.
近年来,人们对研究陆地和海洋来源的天然产物(NP)所代表的化学空间和生物空间之间的差异越来越感兴趣。为了更深入了解这两个化学空间,即海洋天然产物(MNP)和陆地天然产物(TNP),在当前工作中开发了一种机器学习(ML)方法,用于预测三类物质:MNP、TNP以及在陆地和海洋环境中均出现的第三类NP。总共从Reaxys®数据库中检索到22398种NP,其中10790个分子被记录为MNP,10857个为TNP,761种NP同时被记录为MNP和TNP。已经对几种ML算法进行了基准测试,如随机森林、支持向量机和深度学习多层感知器网络。通过共识分类模型获得了最佳性能,该模型对外部测试集的总体预测准确率高达81%。据我们所知,这种方法从未被尝试过,因此它不仅可以用于更好地理解由MNP、TNP或两者定义的化学空间,还可以用于虚拟筛选,以定义MNP和TNP的QSAR模型的适用范围。