Suppr超能文献

在根据氨基酸残基组成对蛋白质二级结构含量进行建模时使用变量选择。

Use of variable selection in modeling the secondary structural content of proteins from their composition of amino acid residues.

作者信息

Pilizota Teuta, Lucić Bono, Trinajstić Nenad

机构信息

The Rugjer Bosković Institute, PO Box 180, HR-10002 Zagreb, Croatia.

出版信息

J Chem Inf Comput Sci. 2004 Jan-Feb;44(1):113-21. doi: 10.1021/ci034037p.

Abstract

The possibility of prediction of protein secondary structure content from composition of their amino acid residues can help in bridging the gap between proteins of known primary sequence having an unknown secondary structure. Almost all recently published models for understanding the relationship between composition (frequency of occurrence) of amino acid residues and secondary structure content of proteins involved composition of all 20 amino acid residues. However, it is well-known that many amino acid residues are mutually similar according to their physicochemical properties (hydrophobicity, hydrophilicity, charge, size, etc.). Because of that, we were motivated to investigate the possibility of reduction of the total number of terms (frequencies of amino acid residues) in the models for describing the relation between the composition of amino acid residues and the percentage of residues belonging to alpha, beta, and coil secondary structure. For this purpose, the CROMRsel algorithm (J. Chem. Inf. Comput. Sci. 1999, 39, 121-132) for selection of a small subset of the most important variables/descriptors into the multiregression (MR) models, i.e., frequency of occurrence of amino acid residues in proteins, was used. Analysis was performed on a data set containing 475 proteins, taken from Proteins 1996, 25, 157-168. A complete data set was partitioned into a 317-protein training set and 158-protein test set. The best possible linear models containing I=1, ..., 20 frequencies were selected among all 20 frequencies of occurrence of amino acid residues on the 317-protein training set, and were used for performing prediction of the corresponding percentage of secondary structure content on the 158-protein test set. For the 317-protein data set the best selected concise models for the alpha, beta, and coil secondary structure contain only 9, 5, and 8 frequencies, respectively. Selected concise models are of the same or better fitted, cross-validated, and predictive statistical parameters than the models containing all 20 frequencies. Additionally, for each I (I=1, ...., 20) 30 the best possible random models were selected. In each case, the best possible real models are much better than each of the best possible random models, showing clearly that there is no risk of a chance correlation (what one could expect due to the application of an exhaustive search for the best model having I frequencies among all 20!/I!(20-I)! possible models). Finally, the best selected models on the complete 475-protein data set for the alpha, beta, and coil secondary structure contain only 7, 4, and 7 frequencies of amino acid residues, respectively. These models are much simpler and have better fitted and cross-validated errors than the corresponding models from the literature, that were obtained without using a procedure for selection of the most important frequencies of amino acid residues in proteins.

摘要

根据氨基酸残基组成预测蛋白质二级结构含量的可能性,有助于弥合已知一级序列但二级结构未知的蛋白质之间的差距。几乎所有最近发表的用于理解氨基酸残基组成(出现频率)与蛋白质二级结构含量之间关系的模型,都涉及所有20种氨基酸残基的组成。然而,众所周知,许多氨基酸残基根据其物理化学性质(疏水性、亲水性、电荷、大小等)相互相似。因此,我们有动力去研究在描述氨基酸残基组成与属于α、β和卷曲二级结构的残基百分比之间关系的模型中,减少项(氨基酸残基频率)总数的可能性。为此,使用了CROMRsel算法(《化学信息与计算机科学杂志》,1999年,39卷,121 - 132页)来选择一小部分最重要的变量/描述符纳入多元回归(MR)模型,即蛋白质中氨基酸残基的出现频率。分析是在一个包含475种蛋白质的数据集上进行的,该数据集取自《蛋白质》1996年第25卷,第157 - 168页。完整的数据集被划分为一个317种蛋白质的训练集和一个158种蛋白质的测试集。在317种蛋白质的训练集上,从所有20种氨基酸残基出现频率中选择包含I = 1, ..., 20频率的最佳线性模型,并用于对158种蛋白质的测试集上相应的二级结构含量百分比进行预测。对于317种蛋白质的数据集,为α、β和卷曲二级结构选择的最佳简洁模型分别仅包含9、5和8种频率。所选的简洁模型在拟合、交叉验证和预测统计参数方面与包含所有20种频率的模型相同或更好。此外,对于每个I(I = 1, ...., 20),选择了30个最佳随机模型。在每种情况下,最佳实际模型都比每个最佳随机模型好得多,这清楚地表明不存在偶然相关性的风险(这是人们在对所有20!/I!(20 - I)!种可能模型中具有I种频率的最佳模型进行穷举搜索时可能预期的)。最后,在完整的475种蛋白质数据集上为α、β和卷曲二级结构选择的最佳模型分别仅包含7、4和7种氨基酸残基频率。这些模型比文献中相应的模型简单得多,并且在拟合和交叉验证误差方面更好,文献中的模型是在未使用选择蛋白质中最重要氨基酸残基频率的程序的情况下获得的。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验