Farnum Michael A, Xu Huafeng, Agrafiotis Dimitris K
3-Dimensional Pharmaceuticals Inc., 665 Stockton Drive, Exton, PA 19341, USA.
Protein Sci. 2003 Aug;12(8):1604-12. doi: 10.1110/ps.0379403.
The explosion of biological data resulting from genomic and proteomic research has created a pressing need for data analysis techniques that work effectively on a large scale. An area of particular interest is the organization and visualization of large families of protein sequences. An increasingly popular approach is to embed the sequences into a low-dimensional Euclidean space in a way that preserves some predefined measure of sequence similarity. This method has been shown to produce maps that exhibit global order and continuity and reveal important evolutionary, structural, and functional relationships between the embedded proteins. However, protein sequences are related by evolutionary pathways that exhibit highly nonlinear geometry, which is invisible to classical embedding procedures such as multidimensional scaling (MDS) and nonlinear mapping (NLM). Here, we describe the use of stochastic proximity embedding (SPE) for producing Euclidean maps that preserve the intrinsic dimensionality and metric structure of the data. SPE extends previous approaches in two important ways: (1) It preserves only local relationships between closely related sequences, thus allowing the map to unfold and reveal its intrinsic dimension, and (2) it scales linearly with the number of sequences and therefore can be applied to very large protein families. The merits of the algorithm are illustrated using examples from the protein kinase and nuclear hormone receptor superfamilies.
基因组学和蛋白质组学研究产生的生物数据爆炸式增长,迫切需要能有效处理大规模数据的分析技术。一个特别受关注的领域是大量蛋白质序列家族的组织和可视化。一种越来越流行的方法是将序列嵌入低维欧几里得空间,同时保留某种预定义的序列相似性度量。已证明该方法能生成展现全局秩序和连续性的图谱,并揭示嵌入蛋白质之间重要的进化、结构和功能关系。然而,蛋白质序列通过具有高度非线性几何特征的进化途径相互关联,而这对于诸如多维缩放(MDS)和非线性映射(NLM)等经典嵌入程序来说是不可见的。在此,我们描述了使用随机邻近嵌入(SPE)来生成保留数据内在维度和度量结构的欧几里得图谱。SPE在两个重要方面扩展了先前的方法:(1)它仅保留密切相关序列之间的局部关系,从而使图谱能够展开并揭示其内在维度;(2)它与序列数量呈线性比例关系,因此可应用于非常大的蛋白质家族。通过蛋白质激酶和核激素受体超家族的实例说明了该算法的优点。