van Uitert Miranda, Meuleman Wouter, Wessels Lodewyk
Bioinformatics and Statistics, Division of Molecular Biology, The Netherlands Cancer Institute, Amsterdam, The Netherlands.
J Comput Biol. 2008 Dec;15(10):1329-45. doi: 10.1089/cmb.2008.0066.
Genomic datasets often consist of large, binary, sparse data matrices. In such a dataset, one is often interested in finding contiguous blocks that (mostly) contain ones. This is a biclustering problem, and while many algorithms have been proposed to deal with gene expression data, only two algorithms have been proposed that specifically deal with binary matrices. None of the gene expression biclustering algorithms can handle the large number of zeros in sparse binary matrices. The two proposed binary algorithms failed to produce meaningful results. In this article, we present a new algorithm that is able to extract biclusters from sparse, binary datasets. A powerful feature is that biclusters with different numbers of rows and columns can be detected, varying from many rows to few columns and few rows to many columns. It allows the user to guide the search towards biclusters of specific dimensions. When applying our algorithm to an input matrix derived from TRANSFAC, we find transcription factors with distinctly dissimilar binding motifs, but a clear set of common targets that are significantly enriched for GO categories.
基因组数据集通常由大型、二进制、稀疏数据矩阵组成。在这样的数据集中,人们通常感兴趣的是找到(大部分)包含“1”的连续块。这是一个双聚类问题,虽然已经提出了许多算法来处理基因表达数据,但只提出了两种专门处理二进制矩阵的算法。没有一种基因表达双聚类算法能够处理稀疏二进制矩阵中大量的“0”。所提出的两种二进制算法未能产生有意义的结果。在本文中,我们提出了一种新算法,它能够从稀疏二进制数据集中提取双聚类。一个强大的功能是可以检测具有不同行数和列数的双聚类,从多行少列到少行多列不等。它允许用户将搜索导向特定维度的双聚类。当将我们的算法应用于从TRANSFAC导出的输入矩阵时,我们发现转录因子具有明显不同的结合基序,但有一组明确的共同靶标,这些靶标在GO类别中显著富集。