Jahandideh Samad, Srinivasasainagendra Vinodh, Zhi Degui
Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA.
Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA.
J Theor Biol. 2012 Nov 7;312:65-75. doi: 10.1016/j.jtbi.2012.07.013. Epub 2012 Aug 3.
RNA-protein interaction plays an important role in various cellular processes, such as protein synthesis, gene regulation, post-transcriptional gene regulation, alternative splicing, and infections by RNA viruses. In this study, using Gene Ontology Annotated (GOA) and Structural Classification of Proteins (SCOP) databases an automatic procedure was designed to capture structurally solved RNA-binding protein domains in different subclasses. Subsequently, we applied tuned multi-class SVM (TMCSVM), Random Forest (RF), and multi-class ℓ1/ℓq-regularized logistic regression (MCRLR) for analysis and classifying RNA-binding protein domains based on a comprehensive set of sequence and structural features. In this study, we compared prediction accuracy of three different state-of-the-art predictor methods. From our results, TMCSVM outperforms the other methods and suggests the potential of TMCSVM as a useful tool for facilitating the multi-class prediction of RNA-binding protein domains. On the other hand, MCRLR by elucidating importance of features for their contribution in predictive accuracy of RNA-binding protein domains subclasses, helps us to provide some biological insights into the roles of sequences and structures in protein-RNA interactions.
RNA与蛋白质的相互作用在各种细胞过程中发挥着重要作用,如蛋白质合成、基因调控、转录后基因调控、可变剪接以及RNA病毒感染。在本研究中,利用基因本体注释(GOA)和蛋白质结构分类(SCOP)数据库,设计了一种自动程序来捕获不同亚类中结构已解析的RNA结合蛋白结构域。随后,我们应用调谐多类支持向量机(TMCSVM)、随机森林(RF)和多类ℓ1/ℓq正则化逻辑回归(MCRLR),基于一组全面的序列和结构特征对RNA结合蛋白结构域进行分析和分类。在本研究中,我们比较了三种不同的最先进预测方法的预测准确性。从我们的结果来看,TMCSVM优于其他方法,表明TMCSVM作为促进RNA结合蛋白结构域多类预测的有用工具具有潜力。另一方面,MCRLR通过阐明特征对RNA结合蛋白结构域亚类预测准确性的贡献的重要性,帮助我们对蛋白质-RNA相互作用中序列和结构的作用提供一些生物学见解。