Suppr超能文献

蛋白质结构和序列相似性对结合亲和力预测的机器学习打分函数准确性的影响。

The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction.

机构信息

SDIVF R&D Centre, Hong Kong Science Park, Sha Tin, New Territories, Hong Kong, China.

Institute of Future Cities, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China.

出版信息

Biomolecules. 2018 Mar 14;8(1):12. doi: 10.3390/biom8010012.

Abstract

It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future.

摘要

最近有人声称,机器学习评分函数(SFs)的出色表现完全归因于存在与测试集中具有高度相似蛋白质的训练复合物。在这里,我们使用 24 个基于相似度的训练集、一个广泛使用的测试集和四个 SF 重新审视这个问题。这四个 SF 中的三个使用机器学习,而不是第四个 SF(X-Score)的经典线性回归方法(X-Score 在 16 个经典 SF 中具有最佳测试集性能)。我们发现,基于随机森林(RF)的 RF-Score-v3 甚至在从训练集中删除 68%最相似的蛋白质后,也能胜过 X-Score。此外,与 X-Score 不同,RF-Score-v3 能够随着训练集大小的增加而继续学习,当使用完整的 1105 个复合物进行训练时,它的预测能力比 X-Score 大幅提高。这些结果表明,机器学习 SF 很大程度上依赖于在与测试集中的蛋白质不相似的复合物上进行训练,这与之前使用相同数据得出的结论相反。考虑到越来越多的结构和相互作用数据将来自学术和工业来源,机器学习 SF 和经典 SF 之间的这种性能差距预计在未来会扩大。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1cac/5871981/6487e2f4c73f/biomolecules-08-00012-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验