Salgado J Cristian, Rapaport Ivan, Asenjo Juan A
Centre for Biochemical Engineering and Biotechnology, Department of Chemical and Biotechnology Engineering, University of Chile, Beauchef 861, Santiago, Chile.
J Chromatogr A. 2006 Feb 24;1107(1-2):120-9. doi: 10.1016/j.chroma.2005.12.033. Epub 2005 Dec 27.
This paper focuses on the prediction of the dimensionless retention time (DRT) of proteins in hydrophobic interaction chromatography (HIC) by means of mathematical models based on the statistical description of the amino acid surface distribution. Previous models characterises the protein surface as a whole. However, most of the time it is not the whole protein but some of its specific regions that interact with the environment. It seems much more natural to use local measurements of the characteristics of the surface. Therefore, the statistical characterisation of the distribution of an amino acid property on the protein surface was carried out from the systematic calculation of the local average of this property in a neighbourhood placed sequentially on each of the amino acids on the protein surface. This process allowed us to characterise the distribution of this property quantitatively using three main statistics: average, standard deviation and maximum. In particular, if the property considered is a hydrophobicity scale, these statistics allowed us to characterise the average hydrophobicity and the hydrophobic content of the most hydrophobic cluster or hotspot, as well as the heterogeneity of the hydrophobicity distribution on the protein surface. We tested the performance of the DRT predictive models based on these statistics on a set of 15 proteins. We obtained better predictive results with respect to the models previously reported. The best predictive model was a linear model based on the maximum. This statistic was calculated using an index of the mobilities of amino acids in chromatography. The predictive performance of this model (measured as the Jack Knife MSE) was 26.9% better than those obtained by the best model which does not consider the amino acid distribution and 19.5% better than the model based on the hydrophobic imbalance (HI). In addition, the best performance was obtained by a linear multivariable model based on the HI and the maximum. The difference between the experimental data and the prediction carried out by this model was smaller than those observed previously. In fact, this model obtained better predictive capacities than a previous linear multivariable model decreasing the Jack Knife MSE in 8.7%. In addition, this model allowed us to diminish the number of variables required, increasing, in this way, the degrees of freedom of the model.
本文聚焦于通过基于氨基酸表面分布统计描述的数学模型,预测蛋白质在疏水作用色谱(HIC)中的无量纲保留时间(DRT)。先前的模型将蛋白质表面作为一个整体来表征。然而,大多数情况下,与环境相互作用的并非整个蛋白质,而是其一些特定区域。使用表面特征的局部测量似乎更为自然。因此,通过系统计算蛋白质表面每个氨基酸上依次放置的邻域中该属性的局部平均值,对蛋白质表面氨基酸属性的分布进行了统计表征。这一过程使我们能够使用三个主要统计量对该属性的分布进行定量表征:平均值、标准差和最大值。特别地,如果所考虑的属性是疏水性标度,这些统计量使我们能够表征平均疏水性、最疏水簇或热点的疏水含量,以及蛋白质表面疏水性分布的异质性。我们在一组15种蛋白质上测试了基于这些统计量的DRT预测模型的性能。相对于先前报道的模型,我们获得了更好的预测结果。最佳预测模型是基于最大值的线性模型。该统计量是使用色谱中氨基酸迁移率的一个指标计算得出的。该模型的预测性能(以留一法均方误差衡量)比不考虑氨基酸分布的最佳模型提高了26.9%,比基于疏水不平衡(HI)的模型提高了19.5%。此外,基于HI和最大值的线性多变量模型取得了最佳性能。该模型的实验数据与预测结果之间的差异比之前观察到的更小。事实上,该模型比先前的线性多变量模型具有更好的预测能力,留一法均方误差降低了8.7%。此外,该模型使我们能够减少所需变量的数量,从而增加模型的自由度。