Suppr超能文献

利用 SPARROW 进行蛋白质二级结构预测。

Protein secondary structure prediction with SPARROW.

机构信息

Freie Universität Berlin, Institut für Chemie, Fabeckstr. 36a, D-14195 Berlin, Germany.

出版信息

J Chem Inf Model. 2012 Feb 27;52(2):545-56. doi: 10.1021/ci200321u. Epub 2012 Jan 23.

Abstract

A first step toward predicting the structure of a protein is to determine its secondary structure. The secondary structure information is generally used as starting point to solve protein crystal structures. In the present study, a machine learning approach based on a complete set of two-class scoring functions was used. Such functions discriminate between two specific structural classes or between a single specific class and the rest. The approach uses a hierarchical scheme of scoring functions and a neural network. The parameters are determined by optimizing the recall of learning data. Quality control is performed by predicting separate independent test data. A first set of scoring functions is trained to correlate the secondary structures of residues with profiles of sequence windows of width 15, centered at these residues. The sequence profiles are obtained by multiple sequence alignment with PSI-BLAST. A second set of scoring functions is trained to correlate the secondary structures of the center residues with the secondary structures of all other residues in the sequence windows used in the first step. Finally, a neural network is trained using the results from the second set of scoring functions as input to make a decision on the secondary structure class of the residue in the center of the sequence window. Here, we consider the three-class problem of helix, strand, and other secondary structures. The corresponding prediction scheme "SPARROW" was trained with the ASTRAL40 database, which contains protein domain structures with less than 40% sequence identity. The secondary structures were determined with DSSP. In a loose assignment, the helix class contains all DSSP helix types (α, 3-10, π), the strand class contains β-strand and β-bridge, and the third class contains the other structures. In a tight assignment, the helix and strand classes contain only α-helix and β-strand classes, respectively. A 10-fold cross validation showed less than 0.8% deviation in the fraction of correct structure assignments between true prediction and recall of data used for training. Using sequences of 140,000 residues as a test data set, 80.46% ± 0.35% of secondary structures are predicted correctly in the loose assignment, a prediction performance, which is very close to the best results in the field. Most applications are done with the loose assignment. However, the tight assignment yields 2.25% better prediction performance. With each individual prediction, we also provide a confidence measure providing the probability that the prediction is correct. The SPARROW software can be used and downloaded on the Web page http://agknapp.chemie.fu-berlin.de/sparrow/ .

摘要

预测蛋白质结构的第一步是确定其二级结构。二级结构信息通常用作解决蛋白质晶体结构的起点。在本研究中,使用了一种基于完整的两类评分函数的机器学习方法。这些函数用于区分两个特定的结构类别或单个特定类别和其他类别。该方法使用评分函数的层次结构和神经网络。参数通过优化学习数据的召回率来确定。通过预测独立的测试数据来进行质量控制。一组评分函数用于训练以将残基的二级结构与以这些残基为中心的宽度为 15 的序列窗口的序列轮廓相关联。序列轮廓通过使用 PSI-BLAST 进行多序列比对获得。第二组评分函数用于训练将中心残基的二级结构与序列窗口中所有其他残基的二级结构相关联,该序列窗口用于第一步。最后,使用第二组评分函数的结果作为输入,使用神经网络对序列窗口中心残基的二级结构类进行决策。在这里,我们考虑螺旋、链和其他二级结构的三类问题。相应的预测方案“SPARROW”使用包含少于 40%序列同一性的蛋白质结构域结构的 ASTRAL40 数据库进行训练。二级结构使用 DSSP 确定。在宽松的分配中,螺旋类包含所有 DSSP 螺旋类型(α、3-10、π),链类包含β-链和β-桥,第三类包含其他结构。在严格的分配中,螺旋类和链类分别仅包含α-螺旋和β-链类。10 倍交叉验证显示,真实预测和用于训练的数据的召回率之间正确结构分配的分数偏差小于 0.8%。使用 140000 个残基的序列作为测试数据集,在松散分配中正确预测了 80.46%±0.35%的二级结构,这是非常接近该领域最佳结果的预测性能。大多数应用程序都在使用松散分配。然而,严格的分配可以提高 2.25%的预测性能。对于每个单独的预测,我们还提供一个置信度度量,提供预测正确的概率。SPARROW 软件可以在网页 http://agknapp.chemie.fu-berlin.de/sparrow/ 上使用和下载。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验