Center for Molecular Bioinformatics, Department of Biology, University of Rome Tor Vergata, Via della Ricerca Scientifica, Rome, Italy.
BMC Bioinformatics. 2009 Oct 24;10:351. doi: 10.1186/1471-2105-10-351.
Many proteins are highly modular, being assembled from globular domains and segments of natively disordered polypeptides. Linear motifs, short sequence modules functioning independently of protein tertiary structure, are most abundant in natively disordered polypeptides but are also found in accessible parts of globular domains, such as exposed loops. The prediction of novel occurrences of known linear motifs attempts the difficult task of distinguishing functional matches from stochastically occurring non-functional matches. Although functionality can only be confirmed experimentally, confidence in a putative motif is increased if a motif exhibits attributes associated with functional instances such as occurrence in the correct taxonomic range, cellular compartment, conservation in homologues and accessibility to interacting partners. Several tools now use these attributes to classify putative motifs based on confidence of functionality.
Current methods assessing motif accessibility do not consider much of the information available, either predicting accessibility from primary sequence or regarding any motif occurring in a globular region as low confidence. We present a method considering accessibility and secondary structural context derived from experimentally solved protein structures to rectify this situation. Putatively functional motif occurrences are mapped onto a representative domain, given that a high quality reference SCOP domain structure is available for the protein itself or a close relative. Candidate motifs can then be scored for solvent-accessibility and secondary structure context. The scores are calibrated on a benchmark set of experimentally verified motif instances compared with a set of random matches. A combined score yields 3-fold enrichment for functional motifs assigned to high confidence classifications and 2.5-fold enrichment for random motifs assigned to low confidence classifications. The structure filter is implemented as a pipeline with both a graphical interface via the ELM resource http://elm.eu.org/ and through a Web Service protocol.
New occurrences of known linear motifs require experimental validation as the bioinformatics tools currently have limited reliability. The ELM structure filter will aid users assessing candidate motifs presenting in globular structural regions. Most importantly, it will help users to decide whether to expend their valuable time and resources on experimental testing of interesting motif candidates.
许多蛋白质具有高度的模块性,由球状结构域和天然无规多肽的片段组装而成。线性基序是独立于蛋白质三级结构发挥功能的短序列模块,在天然无规多肽中最为丰富,但也存在于球状结构域的可及部分,如暴露的环。预测已知线性基序的新出现需要区分功能匹配和随机出现的非功能匹配,这是一项艰巨的任务。虽然功能只能通过实验来证实,但如果一个基序具有与功能实例相关的属性,如在正确的分类范围、细胞区室、同源物中的保守性和与相互作用伙伴的可及性,那么对该基序的可信度就会增加。目前有几种工具使用这些属性根据功能的可信度对假定基序进行分类。
当前评估基序可及性的方法并未充分利用现有信息,要么根据原始序列预测可及性,要么认为任何出现在球状区域的基序可信度都较低。我们提出了一种方法,该方法考虑了来自实验确定的蛋白质结构的可及性和二级结构上下文,以纠正这种情况。在给定蛋白质本身或近亲有高质量参考 SCOP 结构域的情况下,将假定的功能基序出现映射到代表结构域上。然后可以为候选基序的溶剂可及性和二级结构上下文打分。根据实验验证的基序实例与随机匹配集的比较,对得分进行校准。综合得分可使分配给高可信度分类的功能基序富集 3 倍,分配给低可信度分类的随机基序富集 2.5 倍。结构过滤器作为一个管道实现,通过 ELM 资源 http://elm.eu.org/的图形界面和 Web 服务协议实现。
新出现的已知线性基序需要进行实验验证,因为生物信息学工具目前的可靠性有限。ELM 结构过滤器将帮助用户评估在球状结构区域出现的候选基序。最重要的是,它将帮助用户决定是否花费宝贵的时间和资源来对有趣的基序候选进行实验测试。