Suppr超能文献

贝叶斯聚类与不确定数据。

Bayesian clustering with uncertain data.

机构信息

Cambridge Institute of Therapeutic Immunology and Infectious Disease, University of Cambridge, Cambridge, United Kingdom.

MRC Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom.

出版信息

PLoS Comput Biol. 2024 Sep 3;20(9):e1012301. doi: 10.1371/journal.pcbi.1012301. eCollection 2024 Sep.

Abstract

Clustering is widely used in bioinformatics and many other fields, with applications from exploratory analysis to prediction. Many types of data have associated uncertainty or measurement error, but this is rarely used to inform the clustering. We present Dirichlet Process Mixtures with Uncertainty (DPMUnc), an extension of a Bayesian nonparametric clustering algorithm which makes use of the uncertainty associated with data points. We show that DPMUnc out-performs existing methods on simulated data. We cluster immune-mediated diseases (IMD) using GWAS summary statistics, which have uncertainty linked with the sample size of the study. DPMUnc separates autoimmune from autoinflammatory diseases and isolates other subgroups such as adult-onset arthritis. We additionally consider how DPMUnc can be used to cluster gene expression datasets that have been summarised using gene signatures. We first introduce a novel procedure for generating a summary of a gene signature on a dataset different to the one where it was discovered, which incorporates a measure of the variability in expression across signature genes within each individual. We summarise three public gene expression datasets containing patients with a range of IMD, using three relevant gene signatures. We find association between disease and the clusters returned by DPMUnc, with clustering structure replicated across the datasets. The significance of this work is two-fold. Firstly, we demonstrate that when data has associated uncertainty, this uncertainty should be used to inform clustering and we present a method which does this, DPMUnc. Secondly, we present a procedure for using gene signatures in datasets other than where they were originally defined. We show the value of this procedure by summarising gene expression data from patients with immune-mediated diseases using relevant gene signatures, and clustering these patients using DPMUnc.

摘要

聚类在生物信息学和许多其他领域得到了广泛应用,应用范围从探索性分析到预测。许多类型的数据都与不确定性或测量误差相关,但这些数据很少被用于聚类。我们提出了带有不确定性的狄利克雷过程混合模型(DPMUnc),这是一种贝叶斯非参数聚类算法的扩展,该算法利用了与数据点相关的不确定性。我们表明,DPMUnc 在模拟数据上的性能优于现有方法。我们使用与研究样本量相关的不确定性来对免疫介导疾病(GWAS 汇总统计数据)进行聚类。DPMUnc 将自身免疫性疾病与自身炎症性疾病区分开来,并分离出其他亚组,如成人发病性关节炎。我们还考虑了如何使用 DPMUnc 对使用基因特征汇总的基因表达数据集进行聚类。我们首先引入了一种新的程序,用于在与发现基因签名的数据集不同的数据集上生成基因签名的摘要,该程序纳入了每个个体中特征基因表达变异性的度量。我们使用三个相关的基因特征来总结三个包含各种免疫介导疾病患者的公共基因表达数据集。我们发现疾病与 DPMUnc 返回的聚类之间存在关联,并且聚类结构在数据集之间得到了复制。这项工作的意义有两点。首先,我们证明了当数据具有相关的不确定性时,应该使用这种不确定性来为聚类提供信息,并且我们提出了一种这样做的方法,即 DPMUnc。其次,我们提出了一种在原始定义之外的数据集上使用基因特征的程序。我们通过使用相关的基因特征对免疫介导疾病患者的基因表达数据进行总结,并使用 DPMUnc 对这些患者进行聚类,展示了该程序的价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/86d0/11398681/660113c722c2/pcbi.1012301.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验