Mantero Alejandro, Ishwaran Hemant
Division of Biostatistics, University of Miami, Miami, Florida, USA.
Stat Anal Data Min. 2021 Apr;14(2):144-167. doi: 10.1002/sam.11498. Epub 2021 Feb 5.
sidClustering is a new random forests unsupervised machine learning algorithm. The first step in sidClustering involves what is called sidification of the features: staggering the features to have mutually exclusive ranges (called the staggered interaction data [SID] main features) and then forming all pairwise interactions (called the SID interaction features). Then a multivariate random forest (able to handle both continuous and categorical variables) is used to predict the SID main features. We establish uniqueness of sidification and show how multivariate impurity splitting is able to identify clusters. The proposed sidClustering method is adept at finding clusters arising from categorical and continuous variables and retains all the important advantages of random forests. The method is illustrated using simulated and real data as well as two in depth case studies, one from a large multi-institutional study of esophageal cancer, and the other involving hospital charges for cardiovascular patients.
SID聚类是一种新的随机森林无监督机器学习算法。SID聚类的第一步涉及特征的所谓“sid化”:将特征交错排列以具有相互排斥的范围(称为交错交互数据[SID]主特征),然后形成所有成对交互(称为SID交互特征)。然后使用多元随机森林(能够处理连续和分类变量)来预测SID主特征。我们确立了sid化的唯一性,并展示了多元杂质分裂如何能够识别聚类。所提出的SID聚类方法擅长于发现由分类和连续变量产生的聚类,并保留了随机森林的所有重要优点。使用模拟数据和真实数据以及两个深入的案例研究对该方法进行了说明,一个来自对食管癌的大型多机构研究,另一个涉及心血管疾病患者的住院费用。