基于基因表达数据的对比分析双聚类算法。

A comparative analysis of biclustering algorithms for gene expression data.

机构信息

Department of Computer Science and Engineering, The Ohio State University, 3165 Graves Hall 333 West 10th Avenue. Columbus, OH 43210, USA.

出版信息

Brief Bioinform. 2013 May;14(3):279-92. doi: 10.1093/bib/bbs032. Epub 2012 Jul 6.

DOI:10.1093/bib/bbs032

PMID:22772837

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3659300/

Abstract

The need to analyze high-dimension biological data is driving the development of new data mining methods. Biclustering algorithms have been successfully applied to gene expression data to discover local patterns, in which a subset of genes exhibit similar expression levels over a subset of conditions. However, it is not clear which algorithms are best suited for this task. Many algorithms have been published in the past decade, most of which have been compared only to a small number of algorithms. Surveys and comparisons exist in the literature, but because of the large number and variety of biclustering algorithms, they are quickly outdated. In this article we partially address this problem of evaluating the strengths and weaknesses of existing biclustering methods. We used the BiBench package to compare 12 algorithms, many of which were recently published or have not been extensively studied. The algorithms were tested on a suite of synthetic data sets to measure their performance on data with varying conditions, such as different bicluster models, varying noise, varying numbers of biclusters and overlapping biclusters. The algorithms were also tested on eight large gene expression data sets obtained from the Gene Expression Omnibus. Gene Ontology enrichment analysis was performed on the resulting biclusters, and the best enrichment terms are reported. Our analyses show that the biclustering method and its parameters should be selected based on the desired model, whether that model allows overlapping biclusters, and its robustness to noise. In addition, we observe that the biclustering algorithms capable of finding more than one model are more successful at capturing biologically relevant clusters.

摘要

分析高维生物数据的需求推动了新的数据挖掘方法的发展。分簇算法已成功应用于基因表达数据，以发现局部模式，其中一组基因在一组条件下表现出相似的表达水平。然而，目前还不清楚哪种算法最适合这项任务。过去十年中已经发布了许多算法，其中大多数算法仅与少数几种算法进行了比较。文献中存在调查和比较，但由于分簇算法的数量众多且种类繁多，它们很快就过时了。在本文中，我们部分解决了评估现有分簇方法的优缺点的问题。我们使用 BiBench 包比较了 12 种算法，其中许多是最近发布的或尚未广泛研究的算法。这些算法在一系列合成数据集上进行了测试，以衡量它们在不同条件下（例如不同的分簇模型、不同的噪声、不同数量的分簇和重叠分簇）的数据上的性能。这些算法还在从基因表达综合数据库获得的八个大型基因表达数据集上进行了测试。对生成的分簇进行了基因本体富集分析，并报告了最佳的富集术语。我们的分析表明，分簇方法及其参数应根据所需的模型、模型是否允许重叠分簇以及其对噪声的鲁棒性来选择。此外，我们观察到能够找到多个模型的分簇算法更成功地捕获了具有生物学意义的簇。

相似文献

A comparative analysis of biclustering algorithms for gene expression data.基于基因表达数据的对比分析双聚类算法。

Brief Bioinform. 2013 May;14(3):279-92. doi: 10.1093/bib/bbs032. Epub 2012 Jul 6.

Bi-Force: large-scale bicluster editing and its application to gene expression data biclustering.双力法：大规模双聚类编辑及其在基因表达数据双聚类中的应用。

Nucleic Acids Res. 2014 May;42(9):e78. doi: 10.1093/nar/gku201. Epub 2014 Mar 20.

Discovery of error-tolerant biclusters from noisy gene expression data.从嘈杂的基因表达数据中发现容错双聚类。

BMC Bioinformatics. 2011 Nov 24;12 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-12-S12-S1.

Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization.使用高效双聚类算法和并行坐标可视化技术识别基因表达数据中的连贯模式。

BMC Bioinformatics. 2008 Apr 23;9:210. doi: 10.1186/1471-2105-9-210.

Biclustering via optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies.系统生物学中通过数据矩阵的最优重排进行双聚类分析：严格方法与比较研究。

BMC Bioinformatics. 2008 Oct 27;9:458. doi: 10.1186/1471-2105-9-458.

Discovering biclusters in gene expression data based on high-dimensional linear geometries.基于高维线性几何在基因表达数据中发现双簇。

BMC Bioinformatics. 2008 Apr 23;9:209. doi: 10.1186/1471-2105-9-209.

Pattern-driven neighborhood search for biclustering of microarray data.基于模式驱动的基因表达数据子矩阵聚类邻域搜索算法。

BMC Bioinformatics. 2012 May 8;13 Suppl 7(Suppl 7):S11. doi: 10.1186/1471-2105-13-S7-S11.

A graph spectrum based geometric biclustering algorithm.基于图谱的几何二分聚类算法。

J Theor Biol. 2013 Jan 21;317:200-11. doi: 10.1016/j.jtbi.2012.10.012. Epub 2012 Oct 16.

Identification of bicluster regions in a binary matrix and its applications.二值矩阵中双聚类区域的识别及其应用。

PLoS One. 2013 Aug 5;8(8):e71680. doi: 10.1371/journal.pone.0071680. Print 2013.

An evaluation study of biclusters visualization techniques of gene expression data.基因表达数据的双聚类可视化技术评估研究。

J Integr Bioinform. 2021 Oct 27;18(4):20210019. doi: 10.1515/jib-2021-0019.

引用本文的文献

Outcome-guided spike-and-slab Lasso Biclustering: A Novel Approach for Enhancing Biclustering Techniques for Gene Expression Analysis.结果导向的尖峰和平板套索双聚类：一种增强基因表达分析双聚类技术的新方法。

Stat Comput. 2025;35(6):179. doi: 10.1007/s11222-025-10709-4. Epub 2025 Aug 28.

A personalized reinforcement learning recommendation algorithm using bi-clustering techniques.一种使用双聚类技术的个性化强化学习推荐算法。

PLoS One. 2025 Feb 20;20(2):e0315533. doi: 10.1371/journal.pone.0315533. eCollection 2025.

TransBic: bucket trend-preserving biclustering for finding local and interpretable expression patterns.TransBic：用于发现局部且可解释的表达模式的桶趋势保留双聚类

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf050.

Uncovering hidden gene-trait patterns through biclustering analysis of the UK Biobank.通过英国生物银行的双聚类分析揭示隐藏的基因-性状模式。

bioRxiv. 2024 Nov 11:2024.11.08.622657. doi: 10.1101/2024.11.08.622657.

Detecting Boolean Asymmetric Relationships with a Loop Counting Technique and its Implications for Analyzing Heterogeneity within Gene Expression Datasets.使用循环计数技术检测布尔不对称关系及其对分析基因表达数据集中异质性的意义。

IEEE/ACM Trans Comput Biol Bioinform. 2024 Oct 29;PP. doi: 10.1109/TCBB.2024.3487434.

Biclustering data analysis: a comprehensive survey.双聚类数据分析：全面综述。

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae342.

Biclustering of Log Data: Insights from a Computer-Based Complex Problem Solving Assessment.日志数据的双聚类分析：基于计算机的复杂问题解决评估的见解

J Intell. 2024 Jan 17;12(1):10. doi: 10.3390/jintelligence12010010.

G-bic: generating synthetic benchmarks for biclustering.G-bic：生成用于分群分析的合成基准。

BMC Bioinformatics. 2023 Dec 6;24(1):457. doi: 10.1186/s12859-023-05587-4.

TidyGEO: preparing analysis-ready datasets from Gene Expression Omnibus.TidyGEO：从基因表达综合数据库准备分析就绪数据集。

J Integr Bioinform. 2023 Dec 5;21(1). doi: 10.1515/jib-2023-0021. eCollection 2024 Mar 1.

Pipeline for characterizing alternative mechanisms (PCAM) based on bi-clustering to study colorectal cancer heterogeneity.基于双聚类分析的替代机制表征管道（PCAM）用于研究结直肠癌异质性。

Comput Struct Biotechnol J. 2023 Mar 17;21:2160-2171. doi: 10.1016/j.csbj.2023.03.028. eCollection 2023.

本文引用的文献

Differential co-expression framework to quantify goodness of biclusters and compare biclustering algorithms.用于量化双聚类质量并比较双聚类算法的差异共表达框架。

Algorithms Mol Biol. 2010 May 28;5:23. doi: 10.1186/1748-7188-5-23.

FABIA: factor analysis for bicluster acquisition.FABIA：双聚类因子分析。

Bioinformatics. 2010 Jun 15;26(12):1520-7. doi: 10.1093/bioinformatics/btq227. Epub 2010 Apr 23.

Detailing regulatory networks through large scale data integration.通过大规模数据集成来详细描述调控网络。

Bioinformatics. 2009 Dec 15;25(24):3267-74. doi: 10.1093/bioinformatics/btp588. Epub 2009 Oct 13.

QUBIC: a qualitative biclustering algorithm for analyses of gene expression data.QUBIC：一种用于基因表达数据分析的定性双聚类算法。

Nucleic Acids Res. 2009 Aug;37(15):e101. doi: 10.1093/nar/gkp491. Epub 2009 Jun 9.

A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series.一种用于在基因表达时间序列中寻找近似表达模式的多项式时间双聚类算法。

Algorithms Mol Biol. 2009 Jun 4;4:8. doi: 10.1186/1748-7188-4-8.

Bayesian biclustering of gene expression data.基因表达数据的贝叶斯双聚类分析

BMC Genomics. 2008;9 Suppl 1(Suppl 1):S4. doi: 10.1186/1471-2164-9-S1-S4.

Co-clustering: a versatile tool for data analysis in biomedical informatics.共聚类：生物医学信息学中一种用于数据分析的通用工具。

IEEE Trans Inf Technol Biomed. 2007 Jul;11(4):493-4. doi: 10.1109/titb.2007.897575.

Global and regional brain metabolic scaling and its functional consequences.全球及区域脑代谢缩放及其功能后果。

BMC Biol. 2007 May 9;5:18. doi: 10.1186/1741-7007-5-18.

pcaMethods--a bioconductor package providing PCA methods for incomplete data.pcaMethods——一个生物导体软件包，为不完整数据提供主成分分析方法。

Bioinformatics. 2007 May 1;23(9):1164-7. doi: 10.1093/bioinformatics/btm069. Epub 2007 Mar 7.

Using GOstats to test gene lists for GO term association.使用GOstats测试基因列表与GO术语的关联性。

Bioinformatics. 2007 Jan 15;23(2):257-8. doi: 10.1093/bioinformatics/btl567. Epub 2006 Nov 10.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验