Suppr超能文献

经典的对接打分函数无法利用大量的结构和相互作用数据。

Classical scoring functions for docking are unable to exploit large volumes of structural and interaction data.

机构信息

SDIVF R&D Centre, Hong Kong Science Park, Sha Tin, New Territories, Hong Kong.

CUHK-SDU Joint Laboratory on Reproductive Genetics School of Biomedical Sciences, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong.

出版信息

Bioinformatics. 2019 Oct 15;35(20):3989-3995. doi: 10.1093/bioinformatics/btz183.

Abstract

MOTIVATION

Studies have shown that the accuracy of random forest (RF)-based scoring functions (SFs), such as RF-Score-v3, increases with more training samples, whereas that of classical SFs, such as X-Score, does not. Nevertheless, the impact of the similarity between training and test samples on this matter has not been studied in a systematic manner. It is therefore unclear how these SFs would perform when only trained on protein-ligand complexes that are highly dissimilar or highly similar to the test set. It is also unclear whether SFs based on machine learning algorithms other than RF can also improve accuracy with increasing training set size and to what extent they learn from dissimilar or similar training complexes.

RESULTS

We present a systematic study to investigate how the accuracy of classical and machine-learning SFs varies with protein-ligand complex similarities between training and test sets. We considered three types of similarity metrics, based on the comparison of either protein structures, protein sequences or ligand structures. Regardless of the similarity metric, we found that incorporating a larger proportion of similar complexes to the training set did not make classical SFs more accurate. In contrast, RF-Score-v3 was able to outperform X-Score even when trained on just 32% of the most dissimilar complexes, showing that its superior performance owes considerably to learning from dissimilar training complexes to those in the test set. In addition, we generated the first SF employing Extreme Gradient Boosting (XGBoost), XGB-Score, and observed that it also improves with training set size while outperforming the rest of SFs. Given the continuous growth of training datasets, the development of machine-learning SFs has become very appealing.

AVAILABILITY AND IMPLEMENTATION

https://github.com/HongjianLi/MLSF.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

研究表明,基于随机森林 (RF) 的评分函数 (SF),如 RF-Score-v3,随着训练样本数量的增加,其准确性会提高,而经典 SF,如 X-Score,则不会。然而,训练样本与测试样本之间的相似性对这一问题的影响尚未系统地研究过。因此,不清楚这些 SF 在仅使用与测试集高度相似或高度不相似的蛋白质-配体复合物进行训练时会表现如何。也不清楚基于 RF 以外的机器学习算法的 SF 是否也可以随着训练集大小的增加而提高准确性,以及它们在多大程度上从不相似或相似的训练复合物中学习。

结果

我们进行了一项系统研究,以调查经典和机器学习 SF 的准确性如何随训练集和测试集之间的蛋白质-配体复合物相似性而变化。我们考虑了三种基于比较蛋白质结构、蛋白质序列或配体结构的相似性度量。无论相似性度量如何,我们发现将更多比例的相似复合物纳入训练集并不会使经典 SF 更准确。相比之下,即使仅使用 32%最不相似的复合物进行训练,RF-Score-v3 也能够超过 X-Score,这表明其优越的性能在很大程度上归功于从与测试集中的复合物不相似的训练复合物中学习。此外,我们生成了第一个使用极端梯度提升 (XGBoost) 的 SF,XGB-Score,并观察到它随着训练集大小的增加而提高,同时优于其他 SF。鉴于训练数据集的持续增长,机器学习 SF 的开发变得非常有吸引力。

可用性和实现

https://github.com/HongjianLi/MLSF。

补充信息

补充数据可在 Bioinformatics 在线获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验