• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于高通量生物数据中具有分散对象和先验信息的聚类的惩罚加权K均值算法

Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data.

作者信息

Tseng George C

机构信息

Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA.

出版信息

Bioinformatics. 2007 Sep 1;23(17):2247-55. doi: 10.1093/bioinformatics/btm320. Epub 2007 Jun 27.

DOI:10.1093/bioinformatics/btm320
PMID:17597097
Abstract

MOTIVATION

Cluster analysis is one of the most important data mining tools for investigating high-throughput biological data. The existence of many scattered objects that should not be clustered has been found to hinder performance of most traditional clustering algorithms in such a high-dimensional complex situation. Very often, additional prior knowledge from databases or previous experiments is also available in the analysis. Excluding scattered objects and incorporating existing prior information are desirable to enhance the clustering performance.

RESULTS

In this article, a class of loss functions is proposed for cluster analysis and applied in high-throughput genomic and proteomic data. Two major extensions from K-means are involved: penalization and weighting. The additive penalty term is used to allow a set of scattered objects without being clustered. Weights are introduced to account for prior information of preferred or prohibited cluster patterns to be identified. Their relationship with the classification likelihood of Gaussian mixture models is explored. Incorporation of good prior information is also shown to improve the global optimization issue in clustering. Applications of the proposed method on simulated data as well as high-throughput data sets from tandem mass spectrometry (MS/MS) and microarray experiments are presented. Our results demonstrate its superior performance over most existing methods and its computational simplicity and extensibility in the application of large complex biological data sets.

AVAILABILITY

http://www.pitt.edu/~ctseng/research/software.html.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

聚类分析是研究高通量生物数据最重要的数据挖掘工具之一。在这种高维复杂情况下,发现许多不应聚类的分散对象的存在会阻碍大多数传统聚类算法的性能。在分析中,通常还可从数据库或先前实验中获得额外的先验知识。排除分散对象并纳入现有先验信息有助于提高聚类性能。

结果

本文提出了一类用于聚类分析的损失函数,并将其应用于高通量基因组和蛋白质组数据。涉及对K均值算法的两个主要扩展:惩罚和加权。加法惩罚项用于允许一组分散对象不被聚类。引入权重以考虑待识别的偏好或禁止聚类模式的先验信息。探讨了它们与高斯混合模型分类似然性的关系。还表明纳入良好的先验信息可改善聚类中的全局优化问题。展示了所提出方法在模拟数据以及串联质谱(MS/MS)和微阵列实验的高通量数据集上的应用。我们的结果证明了其相对于大多数现有方法的优越性能,以及在应用于大型复杂生物数据集时的计算简便性和可扩展性。

可用性

http://www.pitt.edu/~ctseng/research/software.html。

补充信息

补充数据可在《生物信息学》在线获取。

相似文献

1
Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data.用于高通量生物数据中具有分散对象和先验信息的聚类的惩罚加权K均值算法
Bioinformatics. 2007 Sep 1;23(17):2247-55. doi: 10.1093/bioinformatics/btm320. Epub 2007 Jun 27.
2
Clustering microarray gene expression data using weighted Chinese restaurant process.使用加权中国餐馆过程对微阵列基因表达数据进行聚类
Bioinformatics. 2006 Aug 15;22(16):1988-97. doi: 10.1093/bioinformatics/btl284. Epub 2006 Jun 9.
3
Towards clustering of incomplete microarray data without the use of imputation.迈向无需插补的不完整微阵列数据聚类
Bioinformatics. 2007 Jan 1;23(1):107-13. doi: 10.1093/bioinformatics/btl555. Epub 2006 Oct 31.
4
Automated variable weighting in k-means type clustering.k均值类型聚类中的自动可变加权
IEEE Trans Pattern Anal Mach Intell. 2005 May;27(5):657-68. doi: 10.1109/TPAMI.2005.95.
5
Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach.聚类验证指标的加权排序聚合:一种蒙特卡洛交叉熵方法。
Bioinformatics. 2007 Jul 1;23(13):1607-15. doi: 10.1093/bioinformatics/btm158. Epub 2007 May 5.
6
A roadmap of clustering algorithms: finding a match for a biomedical application.聚类算法路线图:寻找适合生物医学应用的方法。
Brief Bioinform. 2009 May;10(3):297-314. doi: 10.1093/bib/bbn058. Epub 2009 Feb 24.
7
Scalable model-based clustering for large databases based on data summarization.基于数据汇总的大型数据库可扩展模型聚类
IEEE Trans Pattern Anal Mach Intell. 2005 Nov;27(11):1710-9. doi: 10.1109/TPAMI.2005.226.
8
Clustering of change patterns using Fourier coefficients.使用傅里叶系数对变化模式进行聚类。
Bioinformatics. 2008 Jan 15;24(2):184-91. doi: 10.1093/bioinformatics/btm568. Epub 2007 Nov 19.
9
A hypergraph-based learning algorithm for classifying gene expression and arrayCGH data with prior knowledge.基于超图的学习算法,用于对具有先验知识的基因表达和 arrayCGH 数据进行分类。
Bioinformatics. 2009 Nov 1;25(21):2831-8. doi: 10.1093/bioinformatics/btp467. Epub 2009 Jul 30.
10
Inferring pairwise regulatory relationships from multiple time series datasets.从多个时间序列数据集中推断成对的调控关系。
Bioinformatics. 2007 Mar 15;23(6):755-63. doi: 10.1093/bioinformatics/btl676. Epub 2007 Jan 19.

引用本文的文献

1
Astrocyte-derived dominance winning reverses chronic stress-induced depressive behaviors.星形胶质细胞衍生的优势获胜逆转慢性应激诱导的抑郁行为。
Mol Brain. 2024 Aug 27;17(1):59. doi: 10.1186/s13041-024-01134-1.
2
Simultaneous clustering and variable selection: A novel algorithm and model selection procedure.同时聚类和变量选择:一种新算法和模型选择过程。
Behav Res Methods. 2023 Aug;55(5):2157-2174. doi: 10.3758/s13428-022-01795-7. Epub 2022 Sep 9.
3
Network-based cancer heterogeneity analysis incorporating multi-view of prior information.
基于网络的癌症异质性分析,纳入多视图的先验信息。
Bioinformatics. 2022 May 13;38(10):2855-2862. doi: 10.1093/bioinformatics/btac183.
4
A sparse negative binomial mixture model for clustering RNA-seq count data.一种用于对RNA测序计数数据进行聚类的稀疏负二项混合模型。
Biostatistics. 2022 Dec 12;24(1):68-84. doi: 10.1093/biostatistics/kxab025.
5
Identification of glioblastoma-specific prognostic biomarkers via an integrative analysis of DNA methylation and gene expression.通过DNA甲基化和基因表达的综合分析鉴定胶质母细胞瘤特异性预后生物标志物
Oncol Lett. 2020 Aug;20(2):1619-1628. doi: 10.3892/ol.2020.11729. Epub 2020 Jun 11.
6
Comparative Pathway Integrator: A Framework of Meta-Analytic Integration of Multiple Transcriptomic Studies for Consensual and Differential Pathway Analysis.比较途径积分器:共识和差异途径分析的多个转录组学研究综合分析的框架。
Genes (Basel). 2020 Jun 24;11(6):696. doi: 10.3390/genes11060696.
7
Optimally adjusted last cluster for prediction based on balancing the bias and variance by bootstrapping.通过自举法平衡偏差和方差,对基于预测的最后一个聚类进行最优调整。
PLoS One. 2019 Nov 4;14(11):e0223529. doi: 10.1371/journal.pone.0223529. eCollection 2019.
8
Object Weighting: A New Clustering Approach to Deal with Outliers and Cluster Overlap in Computational Biology.目标加权:一种新的聚类方法,用于处理计算生物学中的异常值和聚类重叠问题。
IEEE/ACM Trans Comput Biol Bioinform. 2021 Mar-Apr;18(2):633-643. doi: 10.1109/TCBB.2019.2921577. Epub 2021 Apr 8.
9
Tight clustering for large datasets with an application to gene expression data.针对大型数据集的紧密聚类及其在基因表达数据中的应用。
Sci Rep. 2019 Feb 28;9(1):3053. doi: 10.1038/s41598-019-39459-w.
10
A Survey of Data Mining and Deep Learning in Bioinformatics.生物信息学中的数据挖掘和深度学习调查。
J Med Syst. 2018 Jun 28;42(8):139. doi: 10.1007/s10916-018-1003-9.