Katahira Kentaro, Takano Keisuke, Oba Takeyuki, Kimura Kenta
Human Informatics and Interaction Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, 305-8566, Japan.
BMC Psychol. 2024 Dec 18;12(1):733. doi: 10.1186/s40359-024-02268-6.
Profiling or clustering individuals based on personality and other characteristics is a common statistical approach used in marketing, medicine, and social sciences. This approach improves data simplicity, supports the implementation of a data-driven decision-making process, and guides intervention strategies, such as personalized care. However, the clustering process involves loss of information owing to the discretization of continuous variables. Although any loss of information may be practically or pragmatically acceptable, the amount of information lost and its influence on predicting external outcomes have not yet been systematically investigated.
We assessed the accuracy of predicting physical activity using the clustering approach and compared it with the dimensional approach, where variables are used as continuous regressors. This analysis is based on survey data from a sample of 20,573 individuals regarding physical activity and psychological traits, including the Big-Five personality traits.
A four-cluster solution, supported by the standard criterion for determining the number of clusters, achieved no more than 60-70% prediction accuracy of the dimensional approach employing the raw dimensional scale as explanatory variables.
The cluster solution suggested by conventional statistical criteria may not be optimal when clusters are used to predict external outcomes.
基于个性和其他特征对个体进行剖析或聚类是营销、医学和社会科学中常用的统计方法。这种方法提高了数据的简洁性,支持数据驱动决策过程的实施,并指导干预策略,如个性化护理。然而,聚类过程因连续变量的离散化而导致信息丢失。尽管任何信息丢失在实际或实用层面上可能是可以接受的,但信息丢失的量及其对预测外部结果的影响尚未得到系统研究。
我们使用聚类方法评估预测身体活动的准确性,并将其与维度方法进行比较,在维度方法中变量用作连续回归变量。该分析基于来自20573名个体样本的调查数据,这些数据涉及身体活动和心理特征,包括大五人格特质。
由确定聚类数量的标准标准支持的四聚类解决方案,在使用原始维度量表作为解释变量的维度方法中,预测准确率不超过60%-70%。
当使用聚类来预测外部结果时,传统统计标准建议的聚类解决方案可能不是最优的。