Sharan Malvika, Förstner Konrad U, Eulalio Ana, Vogel Jörg
Institute of Molecular Infection Biology, University of Würzburg, 97080 Würzburg, Germany.
Core Unit Systems Medicine, University of Würzburg, 97080 Würzburg, Germany.
Nucleic Acids Res. 2017 Jun 20;45(11):e96. doi: 10.1093/nar/gkx137.
RNA-binding proteins (RBPs) have been established as core components of several post-transcriptional gene regulation mechanisms. Experimental techniques such as cross-linking and co-immunoprecipitation have enabled the identification of RBPs, RNA-binding domains (RBDs) and their regulatory roles in the eukaryotic species such as human and yeast in large-scale. In contrast, our knowledge of the number and potential diversity of RBPs in bacteria is poorer due to the technical challenges associated with the existing global screening approaches. We introduce APRICOT, a computational pipeline for the sequence-based identification and characterization of proteins using RBDs known from experimental studies. The pipeline identifies functional motifs in protein sequences using position-specific scoring matrices and Hidden Markov Models of the functional domains and statistically scores them based on a series of sequence-based features. Subsequently, APRICOT identifies putative RBPs and characterizes them by several biological properties. Here we demonstrate the application and adaptability of the pipeline on large-scale protein sets, including the bacterial proteome of Escherichia coli. APRICOT showed better performance on various datasets compared to other existing tools for the sequence-based prediction of RBPs by achieving an average sensitivity and specificity of 0.90 and 0.91 respectively. The command-line tool and its documentation are available at https://pypi.python.org/pypi/bio-apricot.
RNA结合蛋白(RBPs)已被确立为几种转录后基因调控机制的核心组成部分。诸如交联和免疫共沉淀等实验技术,使得人们能够大规模地鉴定真核生物(如人类和酵母)中的RBPs、RNA结合结构域(RBDs)及其调控作用。相比之下,由于现有全局筛选方法存在技术挑战,我们对细菌中RBPs的数量和潜在多样性的了解较少。我们引入了APRICOT,这是一种基于序列的计算流程,用于利用实验研究中已知的RBDs来鉴定和表征蛋白质。该流程使用位置特异性评分矩阵和功能域的隐马尔可夫模型来识别蛋白质序列中的功能基序,并根据一系列基于序列的特征对它们进行统计评分。随后,APRICOT识别出假定的RBPs,并通过多种生物学特性对其进行表征。在这里,我们展示了该流程在大规模蛋白质组(包括大肠杆菌的细菌蛋白质组)上的应用和适应性。与其他现有的基于序列预测RBPs的工具相比,APRICOT在各种数据集上表现更佳,其平均灵敏度和特异性分别达到了0.90和0.91。命令行工具及其文档可在https://pypi.python.org/pypi/bio-apricot获取。