School of Life Science and State Key Laboratory of Agrobiotechnology, G94, Science Center South Block, The Chinese University of Hong Kong, Shatin 999077, Hong Kong.
Molecules. 2019 Jun 30;24(13):2414. doi: 10.3390/molecules24132414.
Machine learning plays an important role in ligand-based virtual screening. However, conventional machine learning approaches tend to be inefficient when dealing with such problems where the data are imbalanced and features describing the chemical characteristic of ligands are high-dimensional. We here describe a machine learning algorithm LBS (local beta screening) for ligand-based virtual screening. The unique characteristic of LBS is that it quantifies the generalization ability of screening directly by a refined loss function, and thus can assess the risk of over-fitting accurately and efficiently for imbalanced and high-dimensional data in ligand-based virtual screening without the help of resampling methods such as cross validation. The robustness of LBS was demonstrated by a simulation study and tests on real datasets, in which LBS outperformed conventional algorithms in terms of screening accuracy and model interpretation. LBS was then used for screening potential activators of HIV-1 integrase multimerization in an independent compound library, and the virtual screening result was experimentally validated. Of the 25 compounds tested, six were proved to be active. The most potent compound in experimental validation showed an EC value of 0.71 µM.
机器学习在基于配体的虚拟筛选中起着重要作用。然而,传统的机器学习方法在处理数据不平衡且描述配体化学特征的特征高维的问题时往往效率低下。我们在这里描述了一种用于基于配体的虚拟筛选的机器学习算法 LBS(局部β筛选)。LBS 的独特特征在于,它通过细化的损失函数直接量化筛选的泛化能力,因此可以在没有交叉验证等重采样方法的帮助下,准确有效地评估基于配体的虚拟筛选中不平衡和高维数据的过拟合风险。通过模拟研究和对真实数据集的测试,证明了 LBS 的稳健性,在筛选准确性和模型解释方面,LBS 优于传统算法。然后,LBS 用于筛选 HIV-1 整合酶多聚化的潜在激活剂的独立化合物库,虚拟筛选结果经过实验验证。在测试的 25 种化合物中,有 6 种被证明是有效的。实验验证中最有效的化合物的 EC 值为 0.71µM。