Dumontier Michel, Yao Rong, Feldman Howard J, Hogue Christopher W V
Department of Biochemistry, University of Toronto, Toronto, Ont., Canada M5S 1A8.
J Mol Biol. 2005 Jul 29;350(5):1061-73. doi: 10.1016/j.jmb.2005.05.037.
The identification and annotation of protein domains provides a critical step in the accurate determination of molecular function. Both computational and experimental methods of protein structure determination may be deterred by large multi-domain proteins or flexible linker regions. Knowledge of domains and their boundaries may reduce the experimental cost of protein structure determination by allowing researchers to work on a set of smaller and possibly more successful alternatives. Current domain prediction methods often rely on sequence similarity to conserved domains and as such are poorly suited to detect domain structure in poorly conserved or orphan proteins. We present here a simple computational method to identify protein domain linkers and their boundaries from sequence information alone. Our domain predictor, Armadillo (http://armadillo.blueprint.org), uses any amino acid index to convert a protein sequence to a smoothed numeric profile from which domains and domain boundaries may be predicted. We derived an amino acid index called the domain linker propensity index (DLI) from the amino acid composition of domain linkers using a non-redundant structure dataset. The index indicates that Pro and Gly show a propensity for linker residues while small hydrophobic residues do not. Armadillo predicts domain linker boundaries from Z-score distributions and obtains 35% sensitivity with DLI in a two-domain, single-linker dataset (within +/-20 residues from linker). The combination of DLI and an entropy-based amino acid index increases the overall Armadillo sensitivity to 56% for two domain proteins. Moreover, Armadillo achieves 37% sensitivity for multi-domain proteins, surpassing most other prediction methods. Armadillo provides a simple, but effective method by which prediction of domain boundaries can be obtained with reasonable sensitivity. Armadillo should prove to be a valuable tool for rapidly delineating protein domains in poorly conserved proteins or those with no sequence neighbors. As a first-line predictor, domain meta-predictors could yield improved results with Armadillo predictions.
蛋白质结构域的识别与注释是准确确定分子功能的关键步骤。蛋白质结构测定的计算方法和实验方法都可能受到大型多结构域蛋白质或柔性连接区域的阻碍。了解结构域及其边界可以让研究人员处理一组更小且可能更成功的替代方案,从而降低蛋白质结构测定的实验成本。当前的结构域预测方法通常依赖于与保守结构域的序列相似性,因此不太适合检测保守性较差或孤儿蛋白中的结构域结构。我们在此提出一种简单的计算方法,仅从序列信息中识别蛋白质结构域连接子及其边界。我们的结构域预测工具犰狳(http://armadillo.blueprint.org)使用任何氨基酸指数将蛋白质序列转换为平滑的数字轮廓,从中可以预测结构域和结构域边界。我们使用非冗余结构数据集从结构域连接子的氨基酸组成中推导了一种名为结构域连接子倾向指数(DLI)的氨基酸指数。该指数表明脯氨酸和甘氨酸显示出作为连接子残基的倾向,而小的疏水残基则不然。犰狳根据Z分数分布预测结构域连接子边界,在双结构域、单连接子数据集中(连接子两侧±20个残基范围内)使用DLI时灵敏度达到35%。DLI与基于熵的氨基酸指数相结合,使犰狳对双结构域蛋白质的总体灵敏度提高到56%。此外,犰狳对多结构域蛋白质的灵敏度达到37%,超过了大多数其他预测方法。犰狳提供了一种简单但有效的方法,通过该方法可以以合理的灵敏度获得结构域边界的预测。对于快速描绘保守性较差的蛋白质或没有序列邻域的蛋白质中的结构域,犰狳应被证明是一个有价值的工具。作为一线预测工具,结构域元预测工具结合犰狳的预测可能会产生更好的结果。