Liu Wenxuan, Murphy Thomas Brendan, Brennan Lorraine
UCD School of Agriculture and Food Science, Institute of Food and Health, University College Dublin, Belfield, Dublin, D04 V1W8, Ireland.
UCD School of Mathematics and Statistics, University College Dublin, Belfield, Dublin, D04 V1W8, Ireland.
Sci Rep. 2025 May 22;15(1):17817. doi: 10.1038/s41598-025-02361-9.
Metabolomics is the measurement of metabolites in biological samples to reveal information on metabolic pathways and phenotypes. Cluster analysis is a popular multivariate technique employed in metabolomics to characterise observations with similar features. Previous work in the field has applied hard clustering approaches to group observations into distinct clusters. This approach can be overly restrictive in some practical applications. Therefore, there is a growing need for soft clustering methods that allow for the clustering of observations into more than one cluster. Simplex-structured matrix factorisation (SSMF) is proposed and applied in a simulation study and to a metabolomic dataset to demonstrate its utility for soft clustering. In the simulation study, the cluster prototypes and cluster memberships were well estimated. In the real data application to metabolomic data, the presence of four soft clusters was suggested by the gap statistic. Furthermore, the Shannon diversity index indicated that several observations have memberships in three clusters. Additionally, the introduction of the covariates sex, age and BMI revealed that sex and age mainly associated with the cluster memberships. The results indicate that a majority of men and young people were in the cluster predominantly characterised by high levels of amino acids and low levels of phosphatidylcholines and sphingomyelins. However, a high proportion of older people were characterised by low levels of amino acids, biogenic amines, acylcarnitines and lysophosphatidylcholines. The SSMF presented successfully estimates a soft clustering of the metabolomic data. It provides an interpretable representation of the data structure using the cluster prototypes combined with cluster memberships. A software package called MetabolSSMF has been developed, which is freely available as an R package, to facilitate the implementation of soft clustering in the field of metabolomics.
代谢组学是对生物样品中的代谢物进行测量,以揭示有关代谢途径和表型的信息。聚类分析是代谢组学中常用的一种多变量技术,用于表征具有相似特征的观察结果。该领域以前的工作采用硬聚类方法将观察结果分组为不同的簇。这种方法在某些实际应用中可能过于严格。因此,越来越需要软聚类方法,该方法允许将观察结果聚类到多个簇中。提出了单纯形结构矩阵分解(SSMF),并将其应用于模拟研究和代谢组学数据集,以证明其在软聚类中的效用。在模拟研究中,簇原型和簇成员得到了很好的估计。在代谢组学数据的实际应用中,间隙统计表明存在四个软簇。此外,香农多样性指数表明,一些观察结果在三个簇中都有成员资格。此外,引入协变量性别、年龄和体重指数表明,性别和年龄主要与簇成员资格相关。结果表明,大多数男性和年轻人处于主要以高水平氨基酸和低水平磷脂酰胆碱及鞘磷脂为特征的簇中。然而,高比例的老年人的特征是氨基酸、生物胺、酰基肉碱和溶血磷脂酰胆碱水平较低。SSMF成功地对代谢组学数据进行了软聚类估计。它使用簇原型和簇成员资格提供了数据结构的可解释表示。已经开发了一个名为MetabolSSMF的软件包,作为R包免费提供,以促进代谢组学领域软聚类的实施。