Department of Computer Engineering, Kadir Has University, Cibali, Istanbul 34083, Turkey.
Bioinformatics. 2010 Oct 15;26(20):2594-600. doi: 10.1093/bioinformatics/btq473. Epub 2010 Aug 23.
Biclustering gene expression data is the problem of extracting submatrices of genes and conditions exhibiting significant correlation across both the rows and the columns of a data matrix of expression values. Even the simplest versions of the problem are computationally hard. Most of the proposed solutions therefore employ greedy iterative heuristics that locally optimize a suitably assigned scoring function.
We provide a fast and simple pre-processing algorithm called localization that reorders the rows and columns of the input data matrix in such a way as to group correlated entries in small local neighborhoods within the matrix. The proposed localization algorithm takes its roots from effective use of graph-theoretical methods applied to problems exhibiting a similar structure to that of biclustering. In order to evaluate the effectivenesss of the localization pre-processing algorithm, we focus on three representative greedy iterative heuristic methods. We show how the localization pre-processing can be incorporated into each representative algorithm to improve biclustering performance. Furthermore, we propose a simple biclustering algorithm, Random Extraction After Localization (REAL) that randomly extracts submatrices from the localization pre-processed data matrix, eliminates those with low similarity scores, and provides the rest as correlated structures representing biclusters.
We compare the proposed localization pre-processing with another pre-processing alternative, non-negative matrix factorization. We show that our fast and simple localization procedure provides similar or even better results than the computationally heavy matrix factorization pre-processing with regards to H-value tests. We next demonstrate that the performances of the three representative greedy iterative heuristic methods improve with localization pre-processing when biological correlations in the form of functional enrichment and PPI verification constitute the main performance criteria. The fact that the random extraction method based on localization REAL performs better than the representative greedy heuristic methods under same criteria also confirms the effectiveness of the suggested pre-processing method.
Supplementary material including code implementations in LEDA C++ library, experimental data, and the results are available at http://code.google.com/p/biclustering/
cesim@khas.edu.tr; melihsozdinler@boun.edu.tr
Supplementary data are available at Bioinformatics online.
双聚类基因表达数据是从表达值数据矩阵的行和列中提取具有显著相关性的子矩阵的问题。即使是最简单的问题版本在计算上也是困难的。因此,大多数提出的解决方案都采用贪婪迭代启发式方法,这些方法在适当分配的评分函数上进行局部优化。
我们提供了一种快速而简单的预处理算法,称为定位,它对输入数据矩阵的行和列进行重新排序,以便在矩阵的小局部邻域中对相关条目进行分组。所提出的定位算法源自于对具有与双聚类相似结构的问题应用图论方法的有效使用。为了评估本地化预处理算法的有效性,我们专注于三种有代表性的贪婪迭代启发式方法。我们展示了如何将本地化预处理纳入每个代表性算法中,以提高双聚类性能。此外,我们提出了一种简单的双聚类算法,即定位后随机提取(REAL),该算法从本地化预处理的数据矩阵中随机提取子矩阵,消除那些相似度得分较低的子矩阵,并将其余子矩阵作为表示双聚类的相关结构提供。
我们将所提出的本地化预处理与另一种预处理替代方案非负矩阵分解进行了比较。我们表明,我们的快速而简单的定位过程在 H 值测试方面提供了与计算密集型矩阵分解预处理相似甚至更好的结果。接下来,我们证明了当以功能富集和 PPI 验证形式的生物学相关性构成主要性能标准时,三种有代表性的贪婪迭代启发式方法的性能通过本地化预处理得到了提高。基于定位的 REAL 的随机提取方法在相同标准下表现优于代表性的贪婪启发式方法的事实也证实了所提出的预处理方法的有效性。
补充材料包括在 LEDA C++库中的代码实现、实验数据和结果可在 http://code.google.com/p/biclustering/ 上获得。
cesim@khas.edu.tr;melihsozdinler@boun.edu.tr
补充数据可在 Bioinformatics 在线获得。