The Ohio State University College of Medicine, Columbus, Ohio, USA.
Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio, USA.
J Am Med Inform Assoc. 2020 Jul 1;27(7):1019-1027. doi: 10.1093/jamia/ocaa060.
Unsupervised machine learning approaches hold promise for large-scale clinical data. However, the heterogeneity of clinical data raises new methodological challenges in feature selection, choosing a distance metric that captures biological meaning, and visualization. We hypothesized that clustering could discover prognostic groups from patients with chronic lymphocytic leukemia, a disease that provides biological validation through well-understood outcomes.
To address this challenge, we applied k-medoids clustering with 10 distance metrics to 2 experiments ("A" and "B") with mixed clinical features collapsed to binary vectors and visualized with both multidimensional scaling and t-stochastic neighbor embedding. To assess prognostic utility, we performed survival analysis using a Cox proportional hazard model, log-rank test, and Kaplan-Meier curves.
In both experiments, survival analysis revealed a statistically significant association between clusters and survival outcomes (A: overall survival, P = .0164; B: time from diagnosis to treatment, P = .0039). Multidimensional scaling separated clusters along a gradient mirroring the order of overall survival. Longer survival was associated with mutated immunoglobulin heavy-chain variable region gene (IGHV) status, absent Zap 70 expression, female sex, and younger age.
This approach to mixed-type data handling and selection of distance metric captured well-understood, binary, prognostic markers in chronic lymphocytic leukemia (sex, IGHV mutation status, ZAP70 expression status) with high fidelity.
无监督机器学习方法在大规模临床数据中具有广阔的应用前景。然而,临床数据的异质性给特征选择、选择具有生物学意义的距离度量以及可视化等方面带来了新的方法学挑战。我们假设聚类可以从慢性淋巴细胞白血病患者中发现预后组,这种疾病通过明确的结局提供生物学验证。
为了应对这一挑战,我们应用了 k-medoids 聚类和 10 种距离度量方法,对混合临床特征折叠为二进制向量的两个实验(“A”和“B”)进行分析,并使用多维缩放和 t-随机邻居嵌入进行可视化。为了评估预后的实用性,我们使用 Cox 比例风险模型、对数秩检验和 Kaplan-Meier 曲线进行生存分析。
在两个实验中,生存分析均显示聚类与生存结局之间存在统计学显著关联(A:总生存,P=0.0164;B:从诊断到治疗的时间,P=0.0039)。多维缩放根据与总生存顺序相对应的梯度将聚类分开。较长的生存时间与突变的免疫球蛋白重链可变区基因(IGHV)状态、不存在 Zap 70 表达、女性和年轻有关。
这种处理混合类型数据和选择距离度量的方法以高精度捕捉到慢性淋巴细胞白血病中具有明确生物学意义的、二进制的预后标志物(性别、IGHV 突变状态、ZAP70 表达状态)。