School of Engineering, Pablo de Olavide University, Seville, Spain.
Bioinformatics. 2011 Oct 1;27(19):2738-45. doi: 10.1093/bioinformatics/btr464. Epub 2011 Aug 8.
Binary datasets represent a compact and simple way to store data about the relationships between a group of objects and their possible properties. In the last few years, different biclustering algorithms have been specially developed to be applied to binary datasets. Several approaches based on matrix factorization, suffix trees or divide-and-conquer techniques have been proposed to extract useful biclusters from binary data, and these approaches provide information about the distribution of patterns and intrinsic correlations.
A novel approach to extracting biclusters from binary datasets, BiBit, is introduced here. The results obtained from different experiments with synthetic data reveal the excellent performance and the robustness of BiBit to density and size of input data. Also, BiBit is applied to a central nervous system embryonic tumor gene expression dataset to test the quality of the results. A novel gene expression preprocessing methodology, based on expression level layers, and the selective search performed by BiBit, based on a very fast bit-pattern processing technique, provide very satisfactory results in quality and computational cost. The power of biclustering in finding genes involved simultaneously in different cancer processes is also shown. Finally, a comparison with Bimax, one of the most cited binary biclustering algorithms, shows that BiBit is faster while providing essentially the same results.
The source and binary codes, the datasets used in the experiments and the results can be found at: http://www.upo.es/eps/bigs/BiBit.html
Supplementary data are available at Bioinformatics online.
二进制数据集是一种简洁且简单的方式,可以存储关于一组对象及其可能属性之间关系的数据。在过去的几年中,专门开发了不同的二聚类算法来应用于二进制数据集。已经提出了几种基于矩阵分解、后缀树或分治技术的方法来从二进制数据中提取有用的二聚类,并提供有关模式分布和内在相关性的信息。
本文介绍了一种从二进制数据集中提取二聚类的新方法 BiBit。通过对合成数据进行的不同实验获得的结果表明,BiBit 的性能出色,对输入数据的密度和大小具有鲁棒性。此外,BiBit 还应用于中枢神经系统胚胎肿瘤基因表达数据集,以测试结果的质量。基于表达水平层的基因表达预处理方法和基于非常快速的位模式处理技术的 BiBit 选择性搜索,在质量和计算成本方面提供了非常令人满意的结果。二聚类在发现同时涉及不同癌症过程的基因方面的作用也得到了展示。最后,与最常被引用的二进制二聚类算法之一 Bimax 的比较表明,BiBit 更快,同时提供基本相同的结果。
源代码和二进制代码、实验中使用的数据集以及结果可在以下网址获得:http://www.upo.es/eps/bigs/BiBit.html
补充数据可在生物信息学在线获得。