Gorodkin J, Lund O, Andersen C A, Brunak S
Department of Biotechnology, Technical University of Denmark, Lyngby, Denmark.
Proc Int Conf Intell Syst Mol Biol. 1999:95-105.
Correlations between sequence separation (in residues) and distance (in Angstrom) of any pair of amino acids in polypeptide chains are investigated. For each sequence separation we define a distance threshold. For pairs of amino acids where the distance between C alpha atoms is smaller than the threshold, a characteristic sequence (logo) motif, is found. The motifs change as the sequence separation increases: for small separations they consist of one peak located in between the two residues, then additional peaks at these residues appear, and finally the center peak smears out for very large separations. We also find correlations between the residues in the center of the motif. This and other statistical analysis are used to design neural networks with enhanced performance compared to earlier work. Importantly, the statistical analysis explains why neural networks perform better than simple statistical data-driven approaches such as pair probability density functions. The statistical results also explain characteristics of the network performance for increasing sequence separation. The improvement of the new network design is significant in the sequence separation range 10-30 residues. Finally, we find that the performance curve for increasing sequence separation is directly correlated to the corresponding information content. A WWW server, distanceP, is available at http://www.cbs.dtu.dk/services/distanceP/.
研究了多肽链中任意一对氨基酸的序列间隔(以残基计)与距离(以埃计)之间的相关性。对于每个序列间隔,我们定义一个距离阈值。对于Cα原子之间的距离小于该阈值的氨基酸对,发现了一个特征序列(标志)基序。随着序列间隔的增加,基序会发生变化:对于较小的间隔,它们由位于两个残基之间的一个峰组成,然后在这些残基处会出现额外的峰,最后对于非常大的间隔,中心峰会变得模糊。我们还发现了基序中心残基之间的相关性。与早期工作相比,这种及其他统计分析被用于设计性能增强的神经网络。重要的是,统计分析解释了为什么神经网络比简单的统计数据驱动方法(如对概率密度函数)表现更好。统计结果还解释了随着序列间隔增加网络性能的特征。新网络设计的改进在10 - 30个残基的序列间隔范围内非常显著。最后,我们发现随着序列间隔增加的性能曲线与相应的信息含量直接相关。一个名为distanceP的万维网服务器可在http://www.cbs.dtu.dk/services/distanceP/获取。