Lu Fan, Keles Sündüz, Wright Stephen J, Wahba Grace
Department of Statistics, University of Wisconsin, Madison, WI 53706, USA.
Proc Natl Acad Sci U S A. 2005 Aug 30;102(35):12332-7. doi: 10.1073/pnas.0505411102. Epub 2005 Aug 18.
We develop and apply a previously undescribed framework that is designed to extract information in the form of a positive definite kernel matrix from possibly crude, noisy, incomplete, inconsistent dissimilarity information between pairs of objects, obtainable in a variety of contexts. Any positive definite kernel defines a consistent set of distances, and the fitted kernel provides a set of coordinates in Euclidean space that attempts to respect the information available while controlling for complexity of the kernel. The resulting set of coordinates is highly appropriate for visualization and as input to classification and clustering algorithms. The framework is formulated in terms of a class of optimization problems that can be solved efficiently by using modern convex cone programming software. The power of the method is illustrated in the context of protein clustering based on primary sequence data. An application to the globin family of proteins resulted in a readily visualizable 3D sequence space of globins, where several subfamilies and subgroupings consistent with the literature were easily identifiable.
我们开发并应用了一个此前未被描述的框架,该框架旨在从成对对象之间可能粗糙、有噪声、不完整、不一致的差异信息中提取正定核矩阵形式的信息,这些信息可在各种情况下获取。任何正定核都定义了一组一致的距离,拟合的核提供了欧几里得空间中的一组坐标,该坐标试图在控制核的复杂性的同时尊重可用信息。所得的坐标集非常适合用于可视化,以及作为分类和聚类算法的输入。该框架是根据一类优化问题制定的,可通过使用现代凸锥规划软件有效地求解。该方法的威力在基于一级序列数据的蛋白质聚类背景下得到了说明。对球蛋白家族蛋白质的应用产生了一个易于可视化的球蛋白三维序列空间,其中与文献一致的几个亚家族和亚分组很容易识别。