大语言模型的人格测试：时间稳定性有限，但亲社会性突出。

Personality testing of large language models: limited temporal stability, but highlighted prosociality.

作者信息

Bodroža Bojana, Dinić Bojana M, Bojić Ljubiša

机构信息

Department of Psychology, Faculty of Philosophy, University of Novi Sad, Novi Sad, Serbia.

Institute for Artificial Intelligence Research and Development of Serbia, Novi Sad, Serbia.

出版信息

R Soc Open Sci. 2024 Oct 9;11(10):240180. doi: 10.1098/rsos.240180. eCollection 2024 Oct.

DOI:10.1098/rsos.240180

PMID:39386990

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11461045/

Abstract

As large language models (LLMs) continue to gain popularity due to their human-like traits and the intimacy they offer to users, their societal impact inevitably expands. This leads to the rising necessity for comprehensive studies to fully understand LLMs and reveal their potential opportunities, drawbacks and overall societal impact. With that in mind, this research conducted an extensive investigation into seven LLMs, aiming to assess the temporal stability and inter-rater agreement on their responses on personality instruments in two time points. In addition, LLMs' personality profile was analysed and compared with human normative data. The findings revealed varying levels of inter-rater agreement in the LLMs' responses over a short time, with some LLMs showing higher agreement (e.g. Llama3 and GPT-4o) compared with others (e.g. GPT-4 and Gemini). Furthermore, agreement depended on used instruments as well as on domain or trait. This implies the variable robustness in LLMs' ability to reliably simulate stable personality characteristics. In the case of scales which showed at least fair agreement, LLMs displayed mostly a socially desirable profile in both agentic and communal domains, as well as a prosocial personality profile reflected in higher agreeableness and conscientiousness and lower Machiavellianism. Exhibiting temporal stability and coherent responses on personality traits is crucial for AI systems due to their societal impact and AI safety concerns.

摘要

随着大语言模型（LLMs）因其类人特征以及为用户带来的亲近感而持续受到欢迎，它们对社会的影响不可避免地扩大。这使得全面研究的必要性不断增加，以充分理解大语言模型并揭示其潜在机遇、缺点及整体社会影响。考虑到这一点，本研究对七个大语言模型进行了广泛调查，旨在评估它们在两个时间点对人格量表的回答的时间稳定性和评分者间一致性。此外，还分析了大语言模型的人格特征，并与人类标准数据进行了比较。研究结果显示，在短时间内，大语言模型的回答在评分者间一致性上存在不同水平，与其他模型（如GPT-4和Gemini）相比，一些模型（如Llama3和GPT-4o）显示出更高的一致性。此外，一致性取决于所使用的量表以及领域或特质。这意味着大语言模型可靠模拟稳定人格特征的能力存在可变的稳健性。在显示出至少合理一致性的量表方面，大语言模型在能动性和社群性领域大多呈现出社会期望的特征，以及在更高的宜人性和尽责性以及更低的马基雅维利主义方面所反映出的亲社会人格特征。由于人工智能系统对社会的影响以及人工智能安全问题，在人格特质上表现出时间稳定性和连贯的回答至关重要。

相似文献

Personality testing of large language models: limited temporal stability, but highlighted prosociality.大语言模型的人格测试：时间稳定性有限，但亲社会性突出。

R Soc Open Sci. 2024 Oct 9;11(10):240180. doi: 10.1098/rsos.240180. eCollection 2024 Oct.

Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.评估大型语言模型与人类心理健康整合价值观的一致性：使用施瓦茨基本价值观理论的横断面研究。

JMIR Ment Health. 2024 Apr 9;11:e55988. doi: 10.2196/55988.

Capabilities of GPT-4o and Gemini 1.5 Pro in Gram stain and bacterial shape identification.GPT-4o 和 Gemini 1.5 Pro 在革兰氏染色和细菌形态识别方面的能力。

Future Microbiol. 2024;19(15):1283-1292. doi: 10.1080/17460913.2024.2381967. Epub 2024 Jul 29.

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力：定性研究

JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.

Challenging large language models' "" with human tools: A neuropsychological investigation in Italian language on prefrontal functioning.运用人类工具挑战大型语言模型的“”：一项关于意大利语前额叶功能的神经心理学研究。注：原文中“Challenging large language models' "" with human tools”这里双引号里内容缺失，翻译可能不太准确，需结合完整原文进一步理解。

Heliyon. 2024 Oct 3;10(19):e38911. doi: 10.1016/j.heliyon.2024.e38911. eCollection 2024 Oct 15.

The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.大型语言模型在变革急诊医学中的作用：范围综述

JMIR Med Inform. 2024 May 10;12:e53787. doi: 10.2196/53787.

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断：对流行的大型语言模型的定性研究。

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

AI Psychometrics: Assessing the Psychological Profiles of Large Language Models Through Psychometric Inventories.人工智能心理计量学：通过心理计量学量表评估大型语言模型的心理特征。

Perspect Psychol Sci. 2024 Sep;19(5):808-826. doi: 10.1177/17456916231214460. Epub 2024 Jan 2.

A Language Model-Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study.基于语言模型的模拟患者与自动化反馈的病史采集：前瞻性研究。

JMIR Med Educ. 2024 Aug 16;10:e59213. doi: 10.2196/59213.

Gender Representation of Health Care Professionals in Large Language Model-Generated Stories.大型语言模型生成故事中的医疗保健专业人员的性别代表性。

JAMA Netw Open. 2024 Sep 3;7(9):e2434997. doi: 10.1001/jamanetworkopen.2024.34997.

引用本文的文献

Comparing large Language models and human annotators in latent content analysis of sentiment, political leaning, emotional intensity and sarcasm.在情感、政治倾向、情感强度和讽刺的潜在内容分析中比较大语言模型和人工注释者。

Sci Rep. 2025 Apr 3;15(1):11477. doi: 10.1038/s41598-025-96508-3.

本文引用的文献

Perspect Psychol Sci. 2024 Sep;19(5):808-826. doi: 10.1177/17456916231214460. Epub 2024 Jan 2.

Using cognitive psychology to understand GPT-3.利用认知心理学理解 GPT-3。

Proc Natl Acad Sci U S A. 2023 Feb 7;120(6):e2218523120. doi: 10.1073/pnas.2218523120. Epub 2023 Feb 2.

Psychometric Properties of the HEXACO-100.HEXACO-100 的心理测量学特性。

Assessment. 2018 Jul;25(5):543-556. doi: 10.1177/1073191116659134. Epub 2016 Jul 13.

A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research.可靠性研究中组内相关系数选择与报告指南

J Chiropr Med. 2016 Jun;15(2):155-63. doi: 10.1016/j.jcm.2016.02.012. Epub 2016 Mar 31.

The next Big Five Inventory (BFI-2): Developing and assessing a hierarchical model with 15 facets to enhance bandwidth, fidelity, and predictive power.下一个大五人格量表（BFI-2）：开发和评估一个具有 15 个方面的层次模型，以提高带宽、保真度和预测能力。

J Pers Soc Psychol. 2017 Jul;113(1):117-143. doi: 10.1037/pspp0000096. Epub 2016 Apr 7.

The Bidimensional Impression Management Index (BIMI): measuring agentic and communal forms of impression management.二维印象管理指数（BIMI）：衡量印象管理的能动型和社交型形式。

J Pers Assess. 2014;96(5):523-31. doi: 10.1080/00223891.2013.862252.

Introducing the short Dark Triad (SD3): a brief measure of dark personality traits.介绍简短黑暗三性格量表（SD3）：一种对黑暗人格特质的简短测量工具。

Assessment. 2014 Feb;21(1):28-41. doi: 10.1177/1073191113514105. Epub 2013 Dec 9.

Interrater reliability: the kappa statistic.组内一致性：kappa 统计量。

Biochem Med (Zagreb). 2012;22(3):276-82.

Intraclass correlations: uses in assessing rater reliability.组内相关系数：在评估评分者可靠性中的应用。

Psychol Bull. 1979 Mar;86(2):420-8. doi: 10.1037//0033-2909.86.2.420.

Application and evaluation of the kappa statistic in the design and interpretation of chiropractic clinical research.kappa统计量在整脊临床研究设计与解读中的应用及评估

J Manipulative Physiol Ther. 1997 Oct;20(8):521-8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。