van Heel M
Fritz Haber Institute of the Max Planck Society, Berlin Dahlem, Germany.
J Mol Biol. 1991 Aug 20;220(4):877-87. doi: 10.1016/0022-2836(91)90360-i.
A novel multivariate statistical approach is presented for extracting and exploiting intrinsic information present in our ever-growing sequence data banks. The information extraction from the sequences avoids the pitfalls of intersequence alignment by analyzing secondary invariant functions derived from the sequences in the data bank rather than the sequences themselves. Such typical invariant function is a 20 x 20 histogram of occurrences of amino acid pairs in a given sequence or fragment thereof. To illustrate the potential of the approach an analysis of 10,000 protein sequences from the National Biomedical Research Foundation Protein Identification Resource is presented, whose analysis already reveals great biological detail. For example, zeta-hemoglobin is found to lie close to amphibian and fish chi-hemoglobin which, in turn, is an important clue to the physiological function of this mammalian early embryonic hemoglobin. The multivariate statistical framework presented unifies such apparently unrelated issues as phylogenetic comparisons between a set of sequences and distance matrices between the constituents of the biological sequences. The Multivariate Statistical Sequence Analysis (MSSA) principles can be used for a wide spectrum of sequence analysis problems such as: assignment of family memberships to new sequences, validation of new incoming sequences to be entered into the database, prediction of structure from sequence, discrimination of coding from non-coding DNA regions, and automatic generation of an atlas of protein or DNA sequences. The MSSA techniques represent a self-contained approach to learning continuously and automatically from the growing stream of new sequences. The MSSA approach is particularly likely to play a significant role in major sequencing efforts such as the human genome project.
本文提出了一种新颖的多元统计方法,用于从不断增长的序列数据库中提取和利用内在信息。从序列中提取信息时,该方法通过分析从数据库中的序列而非序列本身导出的二级不变函数,避免了序列间比对的陷阱。这种典型的不变函数是给定序列或其片段中氨基酸对出现情况的20×20直方图。为了说明该方法的潜力,本文对来自国家生物医学研究基金会蛋白质鉴定资源库的10000个蛋白质序列进行了分析,分析结果已揭示出丰富的生物学细节。例如,ζ-血红蛋白被发现与两栖动物和鱼类的χ-血红蛋白相近,这反过来又为这种哺乳动物早期胚胎血红蛋白的生理功能提供了重要线索。所提出的多元统计框架统一了诸如一组序列之间的系统发育比较以及生物序列组成部分之间的距离矩阵等看似不相关的问题。多元统计序列分析(MSSA)原理可用于广泛的序列分析问题,如:将新序列归类到家族成员中、验证要输入数据库的新传入序列、从序列预测结构、区分编码DNA区域和非编码DNA区域,以及自动生成蛋白质或DNA序列图谱。MSSA技术代表了一种独立的方法,可从不断增长的新序列流中持续自动学习。MSSA方法尤其可能在诸如人类基因组计划等重大测序工作中发挥重要作用。