Miller Justin K, Alexander Tristram J
School of Physics, The University of Sydney, Sydney, Australia.
R Soc Open Sci. 2025 Jan 22;12(1):241692. doi: 10.1098/rsos.241692. eCollection 2025 Jan.
Clustering short text is a difficult problem, owing to the low word co-occurrence between short text documents. This work shows that large language models (LLMs) can overcome the limitations of traditional clustering approaches by generating embeddings that capture the semantic nuances of short text. In this study, clusters are found in the embedding space using Gaussian mixture modelling. The resulting clusters are found to be more distinctive and more human-interpretable than clusters produced using the popular methods of doc2vec and latent Dirichlet allocation. The success of the clustering approach is quantified using human reviewers and through the use of a generative LLM. The generative LLM shows good agreement with the human reviewers and is suggested as a means to bridge the 'validation gap' which often exists between cluster production and cluster interpretation. The comparison between LLM coding and human coding reveals intrinsic biases in each, challenging the conventional reliance on human coding as the definitive standard for cluster validation.
由于短文本文件之间的词汇共现率较低,对短文本进行聚类是一个难题。这项工作表明,大语言模型(LLMs)可以通过生成能够捕捉短文本语义细微差别的嵌入来克服传统聚类方法的局限性。在本研究中,使用高斯混合模型在嵌入空间中找到聚类。结果发现,与使用流行的doc2vec和潜在狄利克雷分配方法生成的聚类相比,所得聚类更具独特性且更易于人类解释。使用人工评审人员并通过使用生成式大语言模型对聚类方法的成功进行了量化。生成式大语言模型与人工评审人员显示出良好的一致性,并被建议作为弥合聚类生成与聚类解释之间经常存在的“验证差距”的一种手段。大语言模型编码与人工编码之间的比较揭示了各自内在的偏差,这对传统上依赖人工编码作为聚类验证的决定性标准提出了挑战。