Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America.
PLoS One. 2013 Aug 5;8(8):e71680. doi: 10.1371/journal.pone.0071680. Print 2013.
Biclustering has emerged as an important approach to the analysis of large-scale datasets. A biclustering technique identifies a subset of rows that exhibit similar patterns on a subset of columns in a data matrix. Many biclustering methods have been proposed, and most, if not all, algorithms are developed to detect regions of "coherence" patterns. These methods perform unsatisfactorily if the purpose is to identify biclusters of a constant level. This paper presents a two-step biclustering method to identify constant level biclusters for binary or quantitative data. This algorithm identifies the maximal dimensional submatrix such that the proportion of non-signals is less than a pre-specified tolerance δ. The proposed method has much higher sensitivity and slightly lower specificity than several prominent biclustering methods from the analysis of two synthetic datasets. It was further compared with the Bimax method for two real datasets. The proposed method was shown to perform the most robust in terms of sensitivity, number of biclusters and number of serotype-specific biclusters identified. However, dichotomization using different signal level thresholds usually leads to different sets of biclusters; this also occurs in the present analysis.
双聚类已成为分析大规模数据集的重要方法。双聚类技术可以识别数据矩阵中列的子集上表现出相似模式的行的子集。已经提出了许多双聚类方法,并且大多数(如果不是全部的话)算法都是为了检测“一致性”模式的区域而开发的。如果目的是识别恒定水平的双聚类,则这些方法的性能不佳。本文提出了一种两步双聚类方法,用于识别二进制或定量数据的恒定水平双聚类。该算法确定了最大维度子矩阵,使得非信号的比例小于预定义的容差δ。通过对两个合成数据集的分析,与几个著名的双聚类方法相比,所提出的方法的灵敏度要高得多,特异性要略低一些。它还与 Bimax 方法对两个真实数据集进行了比较。从灵敏度、双聚类数量和鉴定的血清型特异性双聚类数量来看,所提出的方法表现最为稳健。然而,使用不同的信号水平阈值进行二分通常会导致不同的双聚类集;这在本分析中也会发生。