Nemoto Wataru, Toh Hiroyuki
Computational Biology Research Center (CBRC), Advanced Industrial Science and Technology (AIST), AIST Tokyo Waterfront Bio-IT Research Building, 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan.
BMC Struct Biol. 2012 May 29;12:11. doi: 10.1186/1472-6807-12-11.
The detection of conserved residue clusters on a protein structure is one of the effective strategies for the prediction of functional protein regions. Various methods, such as Evolutionary Trace, have been developed based on this strategy. In such approaches, the conserved residues are identified through comparisons of homologous amino acid sequences. Therefore, the selection of homologous sequences is a critical step. It is empirically known that a certain degree of sequence divergence in the set of homologous sequences is required for the identification of conserved residues. However, the development of a method to select homologous sequences appropriate for the identification of conserved residues has not been sufficiently addressed. An objective and general method to select appropriate homologous sequences is desired for the efficient prediction of functional regions.
We have developed a novel index to select the sequences appropriate for the identification of conserved residues, and implemented the index within our method to predict the functional regions of a protein. The implementation of the index improved the performance of the functional region prediction. The index represents the degree of conserved residue clustering on the tertiary structure of the protein. For this purpose, the structure and sequence information were integrated within the index by the application of spatial statistics. Spatial statistics is a field of statistics in which not only the attributes but also the geometrical coordinates of the data are considered simultaneously. Higher degrees of clustering generate larger index scores. We adopted the set of homologous sequences with the highest index score, under the assumption that the best prediction accuracy is obtained when the degree of clustering is the maximum. The set of sequences selected by the index led to higher functional region prediction performance than the sets of sequences selected by other sequence-based methods.
Appropriate homologous sequences are selected automatically and objectively by the index. Such sequence selection improved the performance of functional region prediction. As far as we know, this is the first approach in which spatial statistics have been applied to protein analyses. Such integration of structure and sequence information would be useful for other bioinformatics problems.
检测蛋白质结构上的保守残基簇是预测功能蛋白区域的有效策略之一。基于此策略已开发出多种方法,如进化追踪法。在这类方法中,通过比较同源氨基酸序列来识别保守残基。因此,同源序列的选择是关键步骤。根据经验可知,为识别保守残基,同源序列集合中需要一定程度的序列差异。然而,尚未充分解决开发一种选择适合保守残基识别的同源序列的方法这一问题。为高效预测功能区域,需要一种客观通用的方法来选择合适的同源序列。
我们开发了一种新指标来选择适合保守残基识别的序列,并将该指标应用于我们预测蛋白质功能区域的方法中。该指标的应用提高了功能区域预测的性能。该指标表示蛋白质三级结构上保守残基的聚集程度。为此,通过应用空间统计学将结构和序列信息整合到该指标中。空间统计学是统计学的一个领域,其中不仅考虑数据的属性,还同时考虑数据的几何坐标。更高程度的聚集会产生更大的指标分数。我们采用指标分数最高的同源序列集合,假定聚集程度最大时预测准确率最高。该指标选择的序列集比其他基于序列的方法选择的序列集具有更高的功能区域预测性能。
该指标能自动且客观地选择合适的同源序列。这种序列选择提高了功能区域预测的性能。据我们所知,这是首次将空间统计学应用于蛋白质分析的方法。这种结构和序列信息的整合对其他生物信息学问题也会有用。