Urban Gregor, Torrisi Mirko, Magnan Christophe N, Pollastri Gianluca, Baldi Pierre
Department of Computer Science & Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA.
UCD Institute for Discovery, University College Dublin, Dublin, 4, Ireland.
Comput Struct Biotechnol J. 2020 Aug 27;18:2281-2289. doi: 10.1016/j.csbj.2020.08.015. eCollection 2020.
The use of evolutionary profiles to predict protein secondary structure, as well as other protein structural features, has been standard practice since the 1990s. Using profiles in the input of such predictors, in place or in addition to the sequence itself, leads to significantly more accurate predictions. While profiles can enhance structural signals, their role remains somewhat surprising as proteins do not use profiles when folding in vivo. Furthermore, the same sequence-based redundancy reduction protocols initially derived to train and evaluate sequence-based predictors, have been applied to train and evaluate profile-based predictors. This can lead to unfair comparisons since profiles may facilitate the bleeding of information between training and test sets. Here we use the extensively studied problem of secondary structure prediction to better evaluate the role of profiles and show that: (1) high levels of profile similarity between training and test proteins are observed when using standard sequence-based redundancy protocols; (2) the gain in accuracy for profile-based predictors, over sequence-based predictors, strongly relies on these high levels of profile similarity between training and test proteins; and (3) the overall accuracy of a profile-based predictor on a given protein dataset provides a measure when trying to estimate the actual accuracy of the predictor, or when comparing it to other predictors. We show, however, that this bias can be mitigated by implementing a new protocol (EVALpro) which evaluates the accuracy of profile-based predictors as a function of the profile similarity between training and test proteins. Such a protocol not only allows for a fair comparison of the predictors on equally hard or easy examples, but also reduces the impact of choosing a given similarity cutoff when selecting test proteins. The EVALpro program is available in the SCRATCH suite ( www.scratch.proteomics.ics.uci.edu) and can be downloaded at: www.download.igb.uci.edu/#evalpro.
自20世纪90年代以来,使用进化谱来预测蛋白质二级结构以及其他蛋白质结构特征一直是标准做法。在这类预测器的输入中使用谱,无论是替代序列本身还是与序列本身一起使用,都能显著提高预测的准确性。虽然谱可以增强结构信号,但它们的作用仍然有些令人惊讶,因为蛋白质在体内折叠时并不使用谱。此外,最初为训练和评估基于序列的预测器而推导的相同的基于序列的冗余减少协议,已被应用于训练和评估基于谱的预测器。这可能导致不公平的比较,因为谱可能会促进训练集和测试集之间的信息泄露。在这里,我们利用广泛研究的二级结构预测问题来更好地评估谱的作用,并表明:(1) 使用基于序列的标准冗余协议时,训练蛋白和测试蛋白之间存在高水平的谱相似性;(2) 基于谱的预测器相对于基于序列的预测器在准确性上的提高,强烈依赖于训练蛋白和测试蛋白之间的这些高水平谱相似性;(3) 基于谱的预测器在给定蛋白质数据集上的总体准确性,在试图估计预测器的实际准确性或与其他预测器进行比较时提供了一种度量。然而,我们表明,通过实施一种新的协议(EVALpro)可以减轻这种偏差,该协议根据训练蛋白和测试蛋白之间的谱相似性来评估基于谱的预测器的准确性。这样的协议不仅允许在同样困难或容易的示例上对预测器进行公平比较,而且还减少了在选择测试蛋白时选择给定相似性阈值的影响。EVALpro程序可在SCRATCH套件(www.scratch.proteomics.ics.uci.edu)中获得,可从以下网址下载:www.download.igb.uci.edu/#evalpro 。