Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA.
Colleges of Computer Science and Technology, Jilin University, Changchun 130012, China.
Bioinformatics. 2020 Feb 15;36(4):1143-1149. doi: 10.1093/bioinformatics/btz692.
The biclustering of large-scale gene expression data holds promising potential for detecting condition-specific functional gene modules (i.e. biclusters). However, existing methods do not adequately address a comprehensive detection of all significant bicluster structures and have limited power when applied to expression data generated by RNA-Sequencing (RNA-Seq), especially single-cell RNA-Seq (scRNA-Seq) data, where massive zero and low expression values are observed.
We present a new biclustering algorithm, QUalitative BIClustering algorithm Version 2 (QUBIC2), which is empowered by: (i) a novel left-truncated mixture of Gaussian model for an accurate assessment of multimodality in zero-enriched expression data, (ii) a fast and efficient dropouts-saving expansion strategy for functional gene modules optimization using information divergency and (iii) a rigorous statistical test for the significance of all the identified biclusters in any organism, including those without substantial functional annotations. QUBIC2 demonstrated considerably improved performance in detecting biclusters compared to other five widely used algorithms on various benchmark datasets from E.coli, Human and simulated data. QUBIC2 also showcased robust and superior performance on gene expression data generated by microarray, bulk RNA-Seq and scRNA-Seq.
The source code of QUBIC2 is freely available at https://github.com/OSU-BMBL/QUBIC2.
Supplementary data are available at Bioinformatics online.
大规模基因表达数据的双聚类在检测条件特异性功能基因模块(即双聚类)方面具有很大的潜力。然而,现有的方法不能充分地全面检测所有显著的双聚类结构,并且当应用于 RNA 测序(RNA-Seq)生成的表达数据时,尤其是单细胞 RNA 测序(scRNA-Seq)数据时,其能力有限,因为在这些数据中观察到大量的零和低表达值。
我们提出了一种新的双聚类算法 QUalitative BIClustering algorithm Version 2(QUBIC2),它具有以下特点:(i)一种新的左截断混合高斯模型,用于准确评估富含零值的表达数据中的多模态,(ii)一种快速有效的节省缺失值的扩展策略,用于使用信息分歧优化功能基因模块,以及(iii)一种严格的统计检验,用于检验任何生物体中所有识别出的双聚类的显著性,包括那些没有实质性功能注释的双聚类。与其他五种广泛使用的算法相比,QUBIC2 在各种基准数据集(包括大肠杆菌、人类和模拟数据)上检测双聚类的性能有了显著提高。QUBIC2 在微阵列、批量 RNA-Seq 和 scRNA-Seq 生成的基因表达数据上也表现出了强大而优越的性能。
QUBIC2 的源代码可在 https://github.com/OSU-BMBL/QUBIC2 上免费获得。
补充数据可在生物信息学在线获得。