• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

对具有混合型数据的样本和变量进行聚类。

Clustering of samples and variables with mixed-type data.

作者信息

Hummel Manuela, Edelmann Dominic, Kopp-Schneider Annette

机构信息

Division of Biostatistics, German Cancer Research Center, Heidelberg, Germany.

出版信息

PLoS One. 2017 Nov 28;12(11):e0188274. doi: 10.1371/journal.pone.0188274. eCollection 2017.

DOI:10.1371/journal.pone.0188274
PMID:29182671
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5705083/
Abstract

Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.

摘要

对不同尺度上测量的数据进行分析是一项颇具挑战性的任务。生物医学研究通常聚焦于高通量数据集,例如定量测量数据。然而,整合可能在不同尺度上测量的其他特征(例如临床或细胞遗传学因素)的需求变得越来越重要。然后将分析结果(例如相关基因的选择)进行可视化展示,同时在其上添加更多信息,如临床因素。然而,一种更具综合性的方法是可取的,即联合分析所有可用数据,并且在可视化过程中以更自然的方式组合不同的数据源。在此,我们专门针对整合可视化并提出一种热图样式的图形显示。为此,我们开发并探索用于混合类型数据聚类的方法,特别关注变量聚类。变量聚类在文献中受到的关注不如样本聚类。我们通过两种新方法扩展了变量聚类方法,一种基于不同关联度量的组合,另一种基于距离相关性。通过模拟研究,我们评估并比较了不同的聚类策略。与应用于相应定量或二值化数据的标准方法相比,应用针对混合类型数据的特定方法被证明具有可比性,并且在许多情况下更具优势。我们针对混合类型变量的两种新方法表现出与现有方法ClustOfVar和偏差校正互信息相似或更好的性能。此外,与ClustOfVar不同,我们的方法提供了差异矩阵,这是一个优势,特别是对于可视化目的而言。实际数据示例旨在展示整合热图和基于差异矩阵的其他图形显示的各种潜在应用。我们证明,所呈现的整合热图比常见的数据显示提供了更多关于变量和样本之间关系的信息。所描述的聚类和可视化方法在我们的R包CluMix中实现,可从https://cran.r-project.org/web/packages/CluMix获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/922e2769d6e3/pone.0188274.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/ef80da2d5067/pone.0188274.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/fb56ceb12684/pone.0188274.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/d527de7c3637/pone.0188274.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/1790e78f198a/pone.0188274.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/a0f5a8ff1c82/pone.0188274.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/77530fc04e03/pone.0188274.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/922e2769d6e3/pone.0188274.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/ef80da2d5067/pone.0188274.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/fb56ceb12684/pone.0188274.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/d527de7c3637/pone.0188274.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/1790e78f198a/pone.0188274.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/a0f5a8ff1c82/pone.0188274.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/77530fc04e03/pone.0188274.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/922e2769d6e3/pone.0188274.g007.jpg

相似文献

1
Clustering of samples and variables with mixed-type data.对具有混合型数据的样本和变量进行聚类。
PLoS One. 2017 Nov 28;12(11):e0188274. doi: 10.1371/journal.pone.0188274. eCollection 2017.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Bipartite graph-based approach for clustering of cell lines by gene expression-drug response associations.基于二分图的细胞系聚类方法:通过基因表达-药物反应关联进行聚类
Bioinformatics. 2021 Sep 9;37(17):2617-2626. doi: 10.1093/bioinformatics/btab143.
4
A Bayesian two-way latent structure model for genomic data integration reveals few pan-genomic cluster subtypes in a breast cancer cohort.贝叶斯双向潜在结构模型用于基因组数据整合,揭示乳腺癌队列中很少有泛基因组聚类亚型。
Bioinformatics. 2019 Dec 1;35(23):4886-4897. doi: 10.1093/bioinformatics/btz381.
5
A General Iterative Clustering Algorithm.一种通用迭代聚类算法。
Stat Anal Data Min. 2022 Aug;15(4):433-446. doi: 10.1002/sam.11573. Epub 2022 Jan 31.
6
Automated calibration of consensus weighted distance-based clustering approaches using sharp.使用 sharp 自动校准基于共识权重距离的聚类方法。
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad635.
7
longmixr: a tool for robust clustering of high-dimensional cross-sectional and longitudinal variables of mixed data types.longmixr:一个用于混合数据类型的高维横截面和纵向变量的稳健聚类的工具。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae137.
8
NormalizeMets: assessing, selecting and implementing statistical methods for normalizing metabolomics data.NormalizeMets:评估、选择和实施代谢组学数据标准化的统计方法。
Metabolomics. 2018 Mar 20;14(5):54. doi: 10.1007/s11306-018-1347-7.
9
Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark.针对异质数据的聚类方法的头对头比较:基于模拟的基准测试。
Sci Rep. 2021 Feb 18;11(1):4202. doi: 10.1038/s41598-021-83340-8.
10
Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).大分子拥挤现象:化学与物理邂逅生物学(瑞士阿斯科纳,2012年6月10日至14日)
Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.

引用本文的文献

1
Integrated evaluation of groundwater hydrochemistry using multivariate statistics and irrigation-based water quality indices.利用多元统计和基于灌溉的水质指标对地下水水化学进行综合评价。
Sci Rep. 2025 Jul 10;15(1):24923. doi: 10.1038/s41598-025-09874-3.
2
Identifying subgroups of Chinese men who have sex with men based on sexual behavior and drug use patterns using a clustering analysis approach.使用聚类分析方法,根据性行为和吸毒模式识别中国男男性行为者的亚组。
BMC Public Health. 2025 Apr 10;25(1):1353. doi: 10.1186/s12889-025-22388-x.
3
Exploring the Transitivity Assumption in Network Meta-Analysis: A Novel Approach and Its Implications.

本文引用的文献

1
Function of cancer associated genes revealed by modern univariate and multivariate association tests.现代单变量和多变量关联测试揭示的癌症相关基因的功能
PLoS One. 2015 May 12;10(5):e0126544. doi: 10.1371/journal.pone.0126544. eCollection 2015.
2
Inferring nonlinear gene regulatory networks from gene expression data based on distance correlation.基于距离相关性从基因表达数据推断非线性基因调控网络。
PLoS One. 2014 Feb 14;9(2):e87446. doi: 10.1371/journal.pone.0087446. eCollection 2014.
3
Using distance correlation and SS-ANOVA to assess associations of familial relationships, lifestyle factors, diseases, and mortality.
探索网络荟萃分析中的传递性假设:一种新方法及其影响。
Stat Med. 2025 Mar 30;44(7):e70068. doi: 10.1002/sim.70068.
4
An empirical study on 209 networks of treatments revealed intransitivity to be common and multiple statistical tests suboptimal to assess transitivity.一项针对209个治疗网络的实证研究表明,非可递性很常见,并且多种统计检验在评估可递性方面并不理想。
BMC Med Res Methodol. 2024 Dec 16;24(1):301. doi: 10.1186/s12874-024-02436-7.
5
Topological Structures in the Space of Treatment-Naïve Patients with Chronic Lymphocytic Leukemia.初治慢性淋巴细胞白血病患者空间中的拓扑结构
Cancers (Basel). 2024 Jul 26;16(15):2662. doi: 10.3390/cancers16152662.
6
A robust clustering strategy for stratification unveils unique patient subgroups in acutely decompensated cirrhosis.一项稳健的聚类分层策略揭示了急性失代偿性肝硬化中独特的患者亚组。
J Transl Med. 2024 Jun 27;22(1):599. doi: 10.1186/s12967-024-05386-2.
7
Precision medicine in oncology - machine learning recommendations.肿瘤学中的精准医学——机器学习建议
Am J Cancer Res. 2023 Apr 15;13(4):1617-1619. eCollection 2023.
8
Transgenerational impact of climatic changes on cotton production.气候变化对棉花生产的代际影响。
Front Plant Sci. 2023 Mar 31;14:987514. doi: 10.3389/fpls.2023.987514. eCollection 2023.
9
Use of mixed-type data clustering algorithm for characterizing temporal and spatial distribution of biosecurity border detections of terrestrial non-indigenous species.利用混合类型数据聚类算法刻画陆地非本地物种生物安保边境监测的时空分布特征。
PLoS One. 2022 Aug 9;17(8):e0272413. doi: 10.1371/journal.pone.0272413. eCollection 2022.
10
UL34 Deletion Restricts Human Cytomegalovirus Capsid Formation and Maturation.UL34 缺失限制人巨细胞病毒衣壳形成和成熟。
Int J Mol Sci. 2022 May 21;23(10):5773. doi: 10.3390/ijms23105773.
使用距离相关系数和 SS-ANOVA 评估家族关系、生活方式因素、疾病和死亡率之间的关联。
Proc Natl Acad Sci U S A. 2012 Dec 11;109(50):20352-7. doi: 10.1073/pnas.1217269109. Epub 2012 Nov 21.
4
Predicting relapse in patients with medulloblastoma by integrating evidence from clinical and genomic features.通过整合临床和基因组特征的证据来预测成神经管细胞瘤患者的复发。
J Clin Oncol. 2011 Apr 10;29(11):1415-23. doi: 10.1200/JCO.2010.28.1675. Epub 2011 Feb 28.
5
On Brownian Distance Covariance and High Dimensional Data.关于布朗距离协方差与高维数据
Ann Appl Stat. 2009 Jan 1;3(4):1266-1269. doi: 10.1214/09-AOAS312.
6
A distance-based framework for measuring functional diversity from multiple traits.基于距离的多性状功能多样性测度框架。
Ecology. 2010 Jan;91(1):299-305. doi: 10.1890/08-2244.1.
7
Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis.基于联合潜在变量模型的多种基因组数据类型综合聚类及其在乳腺癌和肺癌亚型分析中的应用。
Bioinformatics. 2009 Nov 15;25(22):2906-12. doi: 10.1093/bioinformatics/btp543. Epub 2009 Sep 16.
8
Biclustering algorithms for biological data analysis: a survey.用于生物数据分析的双聚类算法:一项综述。
IEEE/ACM Trans Comput Biol Bioinform. 2004 Jan-Mar;1(1):24-45. doi: 10.1109/TCBB.2004.2.
9
Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer.乳腺癌对紫杉醇、氟尿嘧啶、阿霉素和环磷酰胺术前化疗敏感性的药物基因组学预测指标
J Clin Oncol. 2006 Sep 10;24(26):4236-44. doi: 10.1200/JCO.2006.05.6861. Epub 2006 Aug 8.
10
Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks.通过贝叶斯网络整合临床和微阵列数据预测乳腺癌的预后。
Bioinformatics. 2006 Jul 15;22(14):e184-90. doi: 10.1093/bioinformatics/btl230.