Martin Juliette
Bases Moléculaires et Structurales des Systèmes Infectieux, CNRS, UMR 5086; Université Lyon 1, IBCP, 7 passage du Vercors F-69367, France.
Proteins. 2014 Jul;82(7):1444-52. doi: 10.1002/prot.24512. Epub 2014 Feb 12.
A number of predictive methods have been developed to predict protein-protein binding sites. Each new method is traditionally benchmarked using sets of protein structures of various sizes, and global statistics are used to assess the quality of the prediction. Little attention has been paid to the potential bias due to protein size on these statistics. Indeed, small proteins involve proportionally more residues at interfaces than large ones. If a predictive method is biased toward small proteins, this can lead to an over-estimation of its performance. Here, we investigate the bias due to the size effect when benchmarking protein-protein interface prediction on the widely used docking benchmark 4.0. First, we simulate random scores that favor small proteins over large ones. Instead of the 0.5 AUC (Area Under the Curve) value expected by chance, these biased scores result in an AUC equal to 0.6 using hypergeometric distributions, and up to 0.65 using constant scores. We then use real prediction results to illustrate how to detect the size bias by shuffling, and subsequently correct it using a simple conversion of the scores into normalized ranks. In addition, we investigate the scores produced by eight published methods and show that they are all affected by the size effect, which can change their relative ranking. The size effect also has an impact on linear combination scores by modifying the relative contributions of each method. In the future, systematic corrections should be applied when benchmarking predictive methods using data sets with mixed protein sizes.
已经开发了许多预测方法来预测蛋白质-蛋白质结合位点。传统上,每种新方法都使用各种大小的蛋白质结构集进行基准测试,并使用全局统计数据来评估预测质量。人们很少关注蛋白质大小对这些统计数据可能产生的偏差。实际上,与大蛋白质相比,小蛋白质在界面处包含的残基比例更大。如果一种预测方法偏向于小蛋白质,这可能会导致对其性能的高估。在这里,我们在广泛使用的对接基准4.0上对蛋白质-蛋白质界面预测进行基准测试时,研究了由于大小效应导致的偏差。首先,我们模拟出有利于小蛋白质而非大蛋白质的随机分数。这些有偏差的分数使用超几何分布得出的曲线下面积(AUC)值不是偶然预期的0.5,而是等于0.6,使用恒定分数时高达0.65。然后,我们使用实际预测结果来说明如何通过洗牌检测大小偏差,并随后通过将分数简单转换为归一化排名来对其进行校正。此外,我们研究了八种已发表方法产生的分数,结果表明它们都受到大小效应的影响,这可能会改变它们的相对排名。大小效应还会通过改变每种方法的相对贡献对线性组合分数产生影响。未来,在使用具有不同大小蛋白质的数据集对预测方法进行基准测试时,应进行系统校正。