基于随机森林特征选择的两阶段支持向量回归法从序列中对B因子分布进行稳健预测。

Robust prediction of B-factor profile from sequence using two-stage SVR based on random forest feature selection.

作者信息

Pan Xiao-Yong, Shen Hong-Bin

机构信息

Institute of Image Processing & Pattern Recognition, Shanghai Jiaotong University, 800 Dongchuan Road, Shanghai, China.

出版信息

Protein Pept Lett. 2009;16(12):1447-54. doi: 10.2174/092986609789839250.

DOI:10.2174/092986609789839250

PMID:20001907

Abstract

B-factor is highly correlated with protein internal motion, which is used to measure the uncertainty in the position of an atom within a crystal structure. Although the rapid progress of structural biology in recent years makes more accurate protein structures available than ever, with the avalanche of new protein sequences emerging during the post-genomic Era, the gap between the known protein sequences and the known protein structures becomes wider and wider. It is urgent to develop automated methods to predict B-factor profile from the amino acid sequences directly, so as to be able to timely utilize them for basic research. In this article, we propose a novel approach, called PredBF, to predict the real value of B-factor. We firstly extract both global and local features from the protein sequences as well as their evolution information, then the random forests feature selection is applied to rank their importance and the most important features are inputted to a two-stage support vector regression (SVR) for prediction, where the initial predicted outputs from the 1(st) SVR are further inputted to the 2nd layer SVR for final refinement. Our results have revealed that a systematic analysis of the importance of different features makes us have deep insights into the different contributions of features and is very necessary for developing effective B-factor prediction tools. The two-layer SVR prediction model designed in this study further enhanced the robustness of predicting the B-factor profile. As a web server, PredBF is freely available at: http://www.csbio.sjtu.edu.cn/bioinf/PredBF for academic use.

摘要

B因子与蛋白质内部运动高度相关，蛋白质内部运动用于衡量晶体结构中原子位置的不确定性。尽管近年来结构生物学取得了快速进展，可获得比以往任何时候都更准确的蛋白质结构，但在后基因组时代，随着新蛋白质序列如雪崩般涌现，已知蛋白质序列与已知蛋白质结构之间的差距越来越大。迫切需要开发直接从氨基酸序列预测B因子分布的自动化方法，以便能够及时将它们用于基础研究。在本文中，我们提出了一种名为PredBF的新方法来预测B因子的实际值。我们首先从蛋白质序列中提取全局和局部特征及其进化信息，然后应用随机森林特征选择对其重要性进行排序，并将最重要的特征输入到两阶段支持向量回归（SVR）中进行预测，其中第一层SVR的初始预测输出进一步输入到第二层SVR中进行最终优化。我们的结果表明，对不同特征的重要性进行系统分析使我们能够深入了解特征的不同贡献，这对于开发有效的B因子预测工具非常必要。本研究设计的两层SVR预测模型进一步增强了预测B因子分布的稳健性。作为一个网络服务器，PredBF可在以下网址免费获取以供学术使用：http://www.csbio.sjtu.edu.cn/bioinf/PredBF 。

相似文献

Robust prediction of B-factor profile from sequence using two-stage SVR based on random forest feature selection.基于随机森林特征选择的两阶段支持向量回归法从序列中对B因子分布进行稳健预测。

Protein Pept Lett. 2009;16(12):1447-54. doi: 10.2174/092986609789839250.

Predicting residue-wise contact orders in proteins by support vector regression.通过支持向量回归预测蛋白质中残基水平的接触序。

BMC Bioinformatics. 2006 Oct 3;7:425. doi: 10.1186/1471-2105-7-425.

Disulfide Connectivity Prediction Based on Modelled Protein 3D Structural Information and Random Forest Regression.基于蛋白质三维结构信息建模和随机森林回归的二硫键连接预测

IEEE/ACM Trans Comput Biol Bioinform. 2015 May-Jun;12(3):611-21. doi: 10.1109/TCBB.2014.2359451.

Large-scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features.基于潜在主题特征的从氨基酸序列大规模预测人类蛋白质-蛋白质相互作用。

J Proteome Res. 2010 Oct 1;9(10):4992-5001. doi: 10.1021/pr100618t.

Improving accuracy of protein contact prediction using balanced network deconvolution.利用平衡网络去卷积提高蛋白质接触预测的准确性。

Proteins. 2015 Mar;83(3):485-96. doi: 10.1002/prot.24744. Epub 2015 Jan 24.

SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.支持向量机折叠法：一种用于判别式多类别蛋白质折叠和超家族识别的工具。

BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.

Learning protein multi-view features in complex space.在复杂空间中学习蛋白质多视图特征。

Amino Acids. 2013 May;44(5):1365-79. doi: 10.1007/s00726-013-1472-6. Epub 2013 Feb 28.

PRBP: Prediction of RNA-Binding Proteins Using a Random Forest Algorithm Combined with an RNA-Binding Residue Predictor.PRBP：结合RNA结合残基预测器，使用随机森林算法预测RNA结合蛋白

IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12(6):1385-93. doi: 10.1109/TCBB.2015.2418773.

Prediction of protein B-factor profiles.蛋白质B因子谱的预测。

Proteins. 2005 Mar 1;58(4):905-12. doi: 10.1002/prot.20375.

Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure.使用多序列特征向量和二级结构从蛋白质序列预测二硫键连接性。

Bioinformatics. 2007 Dec 1;23(23):3147-54. doi: 10.1093/bioinformatics/btm505. Epub 2007 Oct 17.

引用本文的文献

Using graphlet degree vectors to predict atomic displacement parameters in protein structures.利用图元度数向量预测蛋白质结构中的原子位移参数。

Acta Crystallogr D Struct Biol. 2023 Dec 1;79(Pt 12):1109-1119. doi: 10.1107/S2059798323009142. Epub 2023 Nov 21.

The Diagnostic Features of Peripheral Blood Biomarkers in Identifying Osteoarthritis Individuals: Machine Learning Strategies and Clinical Evidence.外周血生物标志物在识别骨关节炎个体中的诊断特征：机器学习策略和临床证据。

Curr Comput Aided Drug Des. 2024;20(6):928-942. doi: 10.2174/1573409920666230818092427.

Machine learning classification of plant genotypes grown under different light conditions through the integration of multi-scale time-series data.通过整合多尺度时间序列数据对不同光照条件下生长的植物基因型进行机器学习分类。

Comput Struct Biotechnol J. 2023 May 23;21:3183-3195. doi: 10.1016/j.csbj.2023.05.005. eCollection 2023.

Identification of Type 2 Diabetes Biomarkers From Mixed Single-Cell Sequencing Data With Feature Selection Methods.利用特征选择方法从混合单细胞测序数据中鉴定2型糖尿病生物标志物

Front Bioeng Biotechnol. 2022 Jun 2;10:890901. doi: 10.3389/fbioe.2022.890901. eCollection 2022.

Weighted-persistent-homology-based machine learning for RNA flexibility analysis.基于加权持久同调的机器学习用于 RNA 柔性分析。

PLoS One. 2020 Aug 21;15(8):e0237747. doi: 10.1371/journal.pone.0237747. eCollection 2020.

Alternative Polyadenylation Modification Patterns Reveal Essential Posttranscription Regulatory Mechanisms of Tumorigenesis in Multiple Tumor Types.可变聚腺苷酸化修饰模式揭示多种肿瘤类型中肿瘤发生的关键转录后调控机制。

Biomed Res Int. 2020 Jun 15;2020:6384120. doi: 10.1155/2020/6384120. eCollection 2020.

Copy Number Variation Pattern for Discriminating MACROD2 States of Colorectal Cancer Subtypes.用于区分结直肠癌亚型MACROD2状态的拷贝数变异模式。

Front Bioeng Biotechnol. 2019 Dec 19;7:407. doi: 10.3389/fbioe.2019.00407. eCollection 2019.

Data-mining Techniques for Image-based Plant Phenotypic Traits Identification and Classification.基于图像的植物表型特征识别和分类的数据挖掘技术。

Sci Rep. 2019 Dec 20;9(1):19526. doi: 10.1038/s41598-019-55609-6.

Immunosignature Screening for Multiple Cancer Subtypes Based on Expression Rule.基于表达规则的多种癌症亚型免疫特征筛查

Front Bioeng Biotechnol. 2019 Nov 29;7:370. doi: 10.3389/fbioe.2019.00370. eCollection 2019.

Primary Tumor Site Specificity is Preserved in Patient-Derived Tumor Xenograft Models.原发性肿瘤部位特异性在患者来源的肿瘤异种移植模型中得以保留。

Front Genet. 2019 Aug 13;10:738. doi: 10.3389/fgene.2019.00738. eCollection 2019.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于随机森林特征选择的两阶段支持向量回归法从序列中对B因子分布进行稳健预测。

Robust prediction of B-factor profile from sequence using two-stage SVR based on random forest feature selection.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献