二元序列的子集聚类及其在基因组异常数据中的应用

Subset clustering of binary sequences, with an application to genomic abnormality data.

作者信息

Hoff Peter D

机构信息

Department of Statistics and Biostatistics, University of Washington, Seattle, 98195-4322, USA.

出版信息

Biometrics. 2005 Dec;61(4):1027-36. doi: 10.1111/j.1541-0420.2005.00381.x.

DOI:10.1111/j.1541-0420.2005.00381.x

PMID:16401276

Abstract

This article develops a model-based approach to clustering multivariate binary data, in which the attributes that distinguish a cluster from the rest of the population may depend on the cluster being considered. The clustering approach is based on a multivariate Dirichlet process mixture model, which allows for the estimation of the number of clusters, the cluster memberships, and the cluster-specific parameters in a unified way. Such a clustering approach has applications in the analysis of genomic abnormality data, in which the development of different types of tumors may depend on the presence of certain abnormalities at subsets of locations along the genome. Additionally, such a mixture model provides a nonparametric estimation scheme for dependent sequences of binary data.

摘要

本文提出了一种基于模型的多变量二元数据聚类方法，其中区分一个聚类与总体中其他部分的属性可能取决于所考虑的聚类。该聚类方法基于多变量狄利克雷过程混合模型，它允许以统一的方式估计聚类的数量、聚类成员关系以及特定于聚类的参数。这种聚类方法在基因组异常数据分析中有应用，其中不同类型肿瘤的发展可能取决于基因组上某些位置子集处特定异常的存在。此外，这种混合模型为二元数据的相关序列提供了一种非参数估计方案。