Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara, 630-0192, Japan.
Data Science Center, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara, 630-0192, Japan.
J Comput Aided Mol Des. 2022 Mar;36(3):237-252. doi: 10.1007/s10822-022-00449-2. Epub 2022 Mar 29.
The retrospective evaluation of virtual screening approaches and activity prediction models are important for methodological development. However, for fair comparison, evaluation data sets must be carefully prepared. In this research, we compiled structure-activity-relationship matrix-based data sets for 15 biological targets along with many diverse inactive compounds, assuming the early stage of structure-activity-relationship progression. To use a large number of diverse inactive compounds and a limited number of active compounds, similarity profiles (SPs) are proposed as a set of molecular descriptors. Using these highly imbalanced data sets, we evaluated various approaches including SPs, under-sampling, support vector machine (SVM), and message passing neural networks. We found that for the under-sampling approaches, cluster-based sampling is better than random sampling. For virtual screening, SPs with inactive reference compounds and the under-sampling SVM also perform well. For classification, SPs with many inactive references performed as well as the under-sampling SVM trained on a balanced data set. Although the performance of SPs and the under-sampling SVM were comparable, SPs with many inactive references were preferable for selecting structurally distinct compounds from the active training compounds.
回顾性评估虚拟筛选方法和活性预测模型对于方法开发非常重要。然而,为了进行公平比较,必须精心准备评估数据集。在这项研究中,我们为 15 个生物靶标以及许多不同的非活性化合物编译了基于结构-活性关系矩阵的数据,假设这是结构-活性关系进展的早期阶段。为了使用大量不同的非活性化合物和有限数量的活性化合物,相似性分布(SP)被提议作为一组分子描述符。使用这些高度不平衡的数据集,我们评估了各种方法,包括 SP、欠采样、支持向量机(SVM)和消息传递神经网络。我们发现,对于欠采样方法,基于聚类的采样优于随机采样。对于虚拟筛选,具有非活性参考化合物的 SP 和欠采样 SVM 也表现良好。对于分类,具有许多非活性参考的 SP 与在平衡数据上训练的欠采样 SVM 一样好。虽然 SP 和欠采样 SVM 的性能相当,但具有许多非活性参考的 SP 更适合从活性训练化合物中选择结构上不同的化合物。