Computer Science and Engineering Department, Government College of Engineering and Leather Technology, Kolkata, India.
Faculty of Mathematics and Information Sciences, Warsaw University of Technology, Warsaw, Poland.
BMC Bioinformatics. 2023 Nov 16;24(1):435. doi: 10.1186/s12859-023-05534-3.
Biclustering of biologically meaningful binary information is essential in many applications related to drug discovery, like protein-protein interactions and gene expressions. However, for robust performance in recently emerging large health datasets, it is important for new biclustering algorithms to be scalable and fast. We present a rapid unsupervised biclustering (RUBic) algorithm that achieves this objective with a novel encoding and search strategy. RUBic significantly reduces the computational overhead on both synthetic and experimental datasets shows significant computational benefits, with respect to several state-of-the-art biclustering algorithms. In 100 synthetic binary datasets, our method took [Formula: see text] s to extract 494,872 biclusters. In the human PPI database of size [Formula: see text], our method generates 1840 biclusters in [Formula: see text] s. On a central nervous system embryonic tumor gene expression dataset of size 712,940, our algorithm takes 101 min to produce 747,069 biclusters, while the recent competing algorithms take significantly more time to produce the same result. RUBic is also evaluated on five different gene expression datasets and shows significant speed-up in execution time with respect to existing approaches to extract significant KEGG-enriched bi-clustering. RUBic can operate on two modes, base and flex, where base mode generates maximal biclusters and flex mode generates less number of clusters and faster based on their biological significance with respect to KEGG pathways. The code is available at ( https://github.com/CMATERJU-BIOINFO/RUBic ) for academic use only.
生物意义上的二值信息的双聚类在许多与药物发现相关的应用中至关重要,如蛋白质-蛋白质相互作用和基因表达。然而,为了在最近出现的大型健康数据集上实现稳健的性能,新的双聚类算法必须具有可扩展性和快速性。我们提出了一种快速无监督双聚类(RUBic)算法,该算法通过一种新的编码和搜索策略实现了这一目标。RUBic 显著降低了合成和实验数据集的计算开销,与几种最新的双聚类算法相比,具有显著的计算优势。在 100 个合成二值数据集上,我们的方法在提取 494872 个双聚类时仅需 [Formula: see text] 秒。在大小为 [Formula: see text] 的人类蛋白质-蛋白质相互作用数据库中,我们的方法在 [Formula: see text] 秒内生成 1840 个双聚类。在大小为 712940 的中枢神经系统胚胎肿瘤基因表达数据集上,我们的算法需要 101 分钟生成 747069 个双聚类,而最近的竞争算法则需要更长的时间才能生成相同的结果。RUBic 还在五个不同的基因表达数据集上进行了评估,与现有方法相比,在提取有意义的 KEGG 富集双聚类方面,它的执行时间显著加快。RUBic 可以在两种模式下运行,即基本模式和灵活模式,其中基本模式生成最大的双聚类,而灵活模式则根据它们相对于 KEGG 途径的生物学意义生成较少数量的聚类和更快的聚类。代码仅可用于学术用途(https://github.com/CMATERJU-BIOINFO/RUBic)。