Navea Susana, Tauler Romá, Goormaghtigh Erik, de Juan Anna
Chemometrics Group, Department of Analytical Chemistry, Universitat de Barcelona, Barcelona, Spain.
Proteins. 2006 May 15;63(3):527-41. doi: 10.1002/prot.20890.
Protein classification and characterization often rely on the information contained in the protein secondary structure. Protein class assignment is usually based on X-ray diffraction measurements, which need the protein in a crystallized form, or on NMR spectra, to obtain the structure of a protein in solution. Simple spectroscopic techniques, such as circular dichroism (CD) and infrared (IR) spectroscopies, are also known to be related to protein secondary structure, but they have seldom been used for protein classification. To see the potential of CD, IR, and combined CD/IR measurements for protein classification, unsupervised pattern recognition methods, Principal Component Analysis (PCA) and cluster analysis, are proposed first to check for natural grouping tendencies of proteins according to their measured spectra. Partial Least Squares Discriminant Analysis (PLS-DA), a supervised pattern recognition method, is used afterwards to test the possibility to model explicitly each protein class and to test these models in class assignment of unknown proteins. Determination of the protein secondary structure, understood as the prediction of the abundance of the different secondary structure motifs in the biomolecule, was carried out with the local regression method interval Partial Least Squares (iPLS). CD, IR, and CD/IR measurements were correlated to the fraction of the motif to be predicted, determined from X-ray measurements. iPLS builds models extracting the spectral information most correlated to a specific secondary motif and avoids the use of irrelevant spectral regions. Spectral intervals chosen by iPLS models provide structural information which can be used to confirm previous biochemical assignments or identify new motif-related spectral features. The predictive ability of the models built with the selected spectral regions has a quality similar to previous classical approaches.
蛋白质分类与表征常常依赖于蛋白质二级结构中所包含的信息。蛋白质类别归属通常基于X射线衍射测量(这需要蛋白质呈结晶形式)或基于核磁共振光谱,以获取溶液中蛋白质的结构。简单的光谱技术,如圆二色性(CD)和红外(IR)光谱,也已知与蛋白质二级结构相关,但它们很少用于蛋白质分类。为了探究CD、IR以及CD/IR组合测量用于蛋白质分类的潜力,首先提出了无监督模式识别方法,即主成分分析(PCA)和聚类分析,以根据蛋白质的测量光谱检查蛋白质的自然分组趋势。随后使用偏最小二乘判别分析(PLS-DA),一种有监督模式识别方法,来测试明确为每个蛋白质类别建模的可能性,并在未知蛋白质的类别归属中测试这些模型。蛋白质二级结构的测定,即预测生物分子中不同二级结构基序的丰度,是用局部回归方法区间偏最小二乘法(iPLS)进行的。CD、IR和CD/IR测量与从X射线测量确定的待预测基序的比例相关。iPLS构建模型,提取与特定二级基序最相关的光谱信息,并避免使用不相关的光谱区域。iPLS模型选择的光谱区间提供了结构信息,可用于确认先前的生化归属或识别与新基序相关的光谱特征。用所选光谱区域构建的模型的预测能力与先前的经典方法具有相似质量。