Ferro Sara, Bottigliengo Daniele, Gregori Dario, Fabricio Aline S C, Gion Massimo, Baldi Ileana
Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac Thoracic Vascular Sciences and Public Health, University of Padova, Via Loredan 18, 35121 Padova, Italy.
Veneto Institute of Oncology IOV-IRCCS, 35128 Padua, Italy.
J Pers Med. 2021 Apr 5;11(4):272. doi: 10.3390/jpm11040272.
Primary breast cancer (PBC) is a heterogeneous disease at the clinical, histopathological, and molecular levels. The improved classification of PBC might be important to identify subgroups of the disease, relevant to patient management. Machine learning algorithms may allow a better understanding of the relationships within heterogeneous clinical syndromes. This work aims to show the potential of unsupervised learning techniques for improving classification in PBC. A dataset of 712 women with PBC is used as a motivating example. A set of variables containing biological prognostic parameters is considered to define groups of individuals. Four different clustering methods are used: K-means, self-organising maps, hierarchical agglomerative (HAC), and Gaussian mixture models clustering. HAC outperforms the other clustering methods. With an optimal partitioning parameter, the methods identify two clusters with different clinical profiles. Patients in the first cluster are younger and have lower values of the oestrogen receptor (ER) and progesterone receptor (PgR) than patients in the second cluster. Moreover, cathepsin D values are lower in the first cluster. The three most important variables identified by the HAC are: age, ER, and PgR. Unsupervised learning seems a suitable alternative for the analysis of PBC data, opening up new perspectives in the particularly active domain of dissecting clinical heterogeneity.
原发性乳腺癌(PBC)在临床、组织病理学和分子水平上是一种异质性疾病。PBC分类的改进对于识别该疾病的亚组可能很重要,这与患者管理相关。机器学习算法可能有助于更好地理解异质性临床综合征之间的关系。这项工作旨在展示无监督学习技术在改善PBC分类方面的潜力。以712名PBC女性患者的数据集作为一个激励性示例。考虑一组包含生物学预后参数的变量来定义个体组。使用了四种不同的聚类方法:K均值、自组织映射、层次凝聚(HAC)和高斯混合模型聚类。HAC优于其他聚类方法。通过一个最优划分参数,这些方法识别出两个具有不同临床特征的聚类。第一聚类中的患者比第二聚类中的患者更年轻,雌激素受体(ER)和孕激素受体(PgR)的值更低。此外,第一聚类中的组织蛋白酶D值更低。HAC识别出的三个最重要变量是:年龄、ER和PgR。无监督学习似乎是分析PBC数据的一种合适替代方法,在剖析临床异质性这个特别活跃的领域开辟了新的视角。