Department of Psychiatry, University of Groningen, University Medical Center Groningen, Interdisciplinary Center Psychopathology and Emotion regulation (ICPE), Groningen, The Netherlands.
Faculty of Philosophy, University of Groningen, Groningen, The Netherlands.
Psychol Med. 2022 Apr;52(6):1089-1100. doi: 10.1017/S0033291720002846. Epub 2020 Aug 11.
Cluster analyses have become popular tools for data-driven classification in biological psychiatric research. However, these analyses are known to be sensitive to the chosen methods and/or modelling options, which may hamper generalizability and replicability of findings. To gain more insight into this problem, we used Specification-Curve Analysis (SCA) to investigate the influence of methodological variation on biomarker-based cluster-analysis results.
Proteomics data (31 biomarkers) were used from patients ( = 688) and healthy controls ( = 426) in the Netherlands Study of Depression and Anxiety. In SCAs, consistency of results was evaluated across 1200 k-means and hierarchical clustering analyses, each with a unique combination of the clustering algorithm, fit-index, and distance metric. Next, SCAs were run in simulated datasets with varying cluster numbers and noise/outlier levels to evaluate the effect of data properties on SCA outcomes.
The real data SCA showed no robust patterns of biological clustering in either the MDD or a combined MDD/healthy dataset. The simulation results showed that the correct number of clusters could be identified quite consistently across the 1200 model specifications, but that correct cluster identification became harder when the number of clusters and noise levels increased.
SCA can provide useful insights into the presence of clusters in biomarker data. However, SCA is likely to show inconsistent results in real-world biomarker datasets that are complex and contain considerable levels of noise. Here, the number and nature of the observed clusters may depend strongly on the chosen model-specification, precluding conclusions about the existence of biological clusters among psychiatric patients.
聚类分析已成为生物精神病学研究中数据驱动分类的流行工具。然而,这些分析方法的选择和/或建模选项已知存在敏感性,这可能会妨碍研究结果的可推广性和可重复性。为了更深入地了解这个问题,我们使用规范曲线分析(SCA)来研究方法学变化对基于生物标志物的聚类分析结果的影响。
使用来自荷兰抑郁和焦虑研究的患者(= 688)和健康对照者(= 426)的蛋白质组学数据(31 个生物标志物)。在 SCA 中,通过 1200 次 k-均值和层次聚类分析评估了结果的一致性,每种分析都具有独特的聚类算法、拟合指数和距离度量的组合。接下来,在具有不同聚类数和噪声/离群值的模拟数据集上运行 SCA,以评估数据特性对 SCA 结果的影响。
真实数据 SCA 并未显示出 MDD 或 MDD/健康混合数据集的生物学聚类的稳健模式。模拟结果表明,在 1200 种模型规格中,可以相当一致地识别正确的聚类数,但当聚类数和噪声水平增加时,正确的聚类识别变得更加困难。
SCA 可以为生物标志物数据中聚类的存在提供有用的见解。然而,SCA 可能会在复杂且包含大量噪声的真实生物标志物数据集中显示出不一致的结果。在这里,观察到的聚类的数量和性质可能强烈依赖于所选的模型规格,从而排除了在精神病人群中存在生物学聚类的结论。