Shen Shiyi, Kai Bo, Ruan Jishou, Torin Huzil J, Carpenter Eric, Tuszynski Jack A
College of Mathematical Science and LPMC, Nankai University, Tianjin 300071, PR China.
Department of Oncology, Division of Experimental Oncology, Cross Cancer Institute, University of Alberta, 11560 University Avenue, Edmonton, Canada AB T6G 1Z2.
Physica A. 2006 Oct 15;370(2):651-662. doi: 10.1016/j.physa.2006.03.004. Epub 2006 Apr 3.
Here, we describe a unique probabilistic evaluation of the 20, naturally occurring, amino acids and their distributions within the Swiss-Prot and Complete Human Genebank databases. We have developed a computational technique that imparts both directionality and length constraints into searches for unique combinations of amino acids within protein sequences. Using statistical approaches, we have carried out searches of all possible two- and three-residue motifs contained within these databases. This technique is based on the unusually high occurrence of a small number of these motifs when compared to the expected probability of finding a specific residue grouping within a given database. Subsequent filtering of this search to identify such unique combinations has provided several examples that can be used as markers to identify particular proteins within or across databases. We focus on three of these motifs, which were found to be of greatest interest to us. The CC, CM and a combination of the two, CCM motifs all occur either more or less frequently than would be predicted based on standard amino acid distributions within the entire human proteome.
在此,我们描述了对20种天然存在的氨基酸及其在Swiss-Prot和完整人类基因库数据库中的分布进行的独特概率评估。我们开发了一种计算技术,该技术在搜索蛋白质序列中氨基酸的独特组合时赋予方向性和长度限制。使用统计方法,我们对这些数据库中包含的所有可能的二残基和三残基基序进行了搜索。与在给定数据库中找到特定残基分组的预期概率相比,该技术基于少数这些基序的异常高出现率。对该搜索进行后续筛选以识别此类独特组合,提供了几个可用作标记物以识别数据库内或跨数据库的特定蛋白质的示例。我们专注于其中三个基序,发现它们对我们最具吸引力。CC、CM以及两者的组合CCM基序,其出现频率均高于或低于基于整个人类蛋白质组中标准氨基酸分布所预测的频率。