Nepal Reecha, Spencer Joanna, Bhogal Guneet, Nedunuri Amulya, Poelman Thomas, Kamath Thejas, Chung Edwin, Kantardjieff Katherine, Gottlieb Andrea, Lustig Brooke
Department of Chemistry, San Jose State University , San Jose, CA 95192-0101, USA.
Department of Mathematics and Statistics, San Jose State University , San Jose, CA 95192-0101, USA.
J Appl Crystallogr. 2015 Nov 10;48(Pt 6):1976-1984. doi: 10.1107/S1600576715018531. eCollection 2015 Dec 1.
A working example of relative solvent accessibility (RSA) prediction for proteins is presented. Novel logistic regression models with various qualitative descriptors that include amino acid type and quantitative descriptors that include 20- and six-term sequence entropy have been built and validated. A domain-complete learning set of over 1300 proteins is used to fit initial models with various sequence homology descriptors as well as query residue qualitative descriptors. Homology descriptors are derived from BLASTp sequence alignments, whereas the RSA values are determined directly from the crystal structure. The logistic regression models are fitted using dichotomous responses indicating buried or accessible solvent, with binary classifications obtained from the RSA values. The fitted models determine binary predictions of residue solvent accessibility with accuracies comparable to other less computationally intensive methods using the standard RSA threshold criteria 20 and 25% as solvent accessible. When an additional non-homology descriptor describing Lobanov-Galzitskaya residue disorder propensity is included, incremental improvements in accuracy are achieved with 25% threshold accuracies of 76.12 and 74.79% for the Manesh-215 and CASP(8+9) test sets, respectively. Moreover, the described software and the accompanying learning and validation sets allow students and researchers to explore the utility of RSA prediction with simple, physically intuitive models in any number of related applications.
本文给出了蛋白质相对溶剂可及性(RSA)预测的一个实际例子。构建并验证了具有各种定性描述符(包括氨基酸类型)和定量描述符(包括20项和6项序列熵)的新型逻辑回归模型。使用一个包含1300多种蛋白质的结构域完整学习集来拟合具有各种序列同源性描述符以及查询残基定性描述符的初始模型。同源性描述符来自BLASTp序列比对,而RSA值直接从晶体结构中确定。逻辑回归模型使用表示埋藏或可及溶剂的二分响应进行拟合,二元分类从RSA值获得。拟合模型确定残基溶剂可及性的二元预测,其准确性与使用标准RSA阈值标准(20%和25%作为溶剂可及)的其他计算强度较小的方法相当。当包含一个描述Lobanov-Galzitskaya残基无序倾向的额外非同源性描述符时,对于Manesh-215和CASP(8+9)测试集,在25%阈值下准确性分别有76.12%和74.79%的增量提高。此外,所描述的软件以及附带的学习和验证集使学生和研究人员能够在任何数量的相关应用中,用简单、物理直观的模型探索RSA预测的效用。