Marsden Russell L, McGuffin Liam J, Jones David T
Bioinformatics Unit, Department of Computer Science, University College London, UK.
Protein Sci. 2002 Dec;11(12):2814-24. doi: 10.1110/ps.0209902.
The elucidation of the domain content of a given protein sequence in the absence of determined structure or significant sequence homology to known domains is an important problem in structural biology. Here we address how successfully the delineation of continuous domains can be accomplished in the absence of sequence homology using simple baseline methods, an existing prediction algorithm (Domain Guess by Size), and a newly developed method (DomSSEA). The study was undertaken with a view to measuring the usefulness of these prediction methods in terms of their application to fully automatic domain assignment. Thus, the sensitivity of each domain assignment method was measured by calculating the number of correctly assigned top scoring predictions. We have implemented a new continuous domain identification method using the alignment of predicted secondary structures of target sequences against observed secondary structures of chains with known domain boundaries as assigned by Class Architecture Topology Homology (CATH). Taking top predictions only, the success rate of the method in correctly assigning domain number to the representative chain set is 73.3%. The top prediction for domain number and location of domain boundaries was correct for 24% of the multidomain set (+/-20 residues). These results have been put into context in relation to the results obtained from the other prediction methods assessed.
在缺乏确定的结构或与已知结构域显著的序列同源性的情况下,阐明给定蛋白质序列的结构域内容是结构生物学中的一个重要问题。在此,我们探讨了在缺乏序列同源性的情况下,使用简单的基线方法、一种现有的预测算法(按大小猜测结构域)和一种新开发的方法(DomSSEA),能够在多大程度上成功完成连续结构域的划分。进行这项研究的目的是衡量这些预测方法在应用于全自动结构域分配方面的有用性。因此,通过计算正确分配的最高得分预测的数量来衡量每种结构域分配方法的灵敏度。我们使用目标序列预测的二级结构与由类结构拓扑同源性(CATH)指定的具有已知结构域边界的链的观察到的二级结构进行比对,实现了一种新的连续结构域识别方法。仅考虑最高预测结果,该方法将结构域编号正确分配给代表性链集的成功率为73.3%。对于24%的多结构域集(±20个残基),结构域编号和结构域边界位置的最高预测是正确的。已将这些结果与从评估的其他预测方法获得的结果联系起来进行了背景分析。