• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

MacBehaviour:一个用于对大语言模型进行行为实验的R软件包。

MacBehaviour: An R package for behavioural experimentation on large language models.

作者信息

Duan Xufeng, Li Shixuan, Cai Zhenguang G

机构信息

Department of Linguistics and Modern Languages, The Chinese University of Hong Kong, Hong Kong, China.

Brain and Mind Institute, The Chinese University of Hong Kong, Hong Kong, China.

出版信息

Behav Res Methods. 2024 Dec 18;57(1):19. doi: 10.3758/s13428-024-02524-y.

DOI:10.3758/s13428-024-02524-y
PMID:39694977
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11655609/
Abstract

The study of large language models (LLMs) and LLM-powered chatbots has gained significant attention in recent years, with researchers treating LLMs as participants in psychological experiments. To facilitate this research, we developed an R package called "MacBehaviour " ( https://github.com/xufengduan/MacBehaviour ), which interacts with over 100 LLMs, including OpenAI's GPT family, the Claude family, Gemini, Llama family, and other open-weight models. The package streamlines the processes of LLM behavioural experimentation by providing a comprehensive set of functions for experiment design, stimuli presentation, model behaviour manipulation, and logging responses and token probabilities. With a few lines of code, researchers can seamlessly set up and conduct psychological experiments, making LLM behaviour studies highly accessible. To validate the utility and effectiveness of "MacBehaviour," we conducted three experiments on GPT-3.5 Turbo, Llama-2-7b-chat-hf, and Vicuna-1.5-13b, replicating the sound-gender association in LLMs. The results consistently demonstrated that these LLMs exhibit human-like tendencies to infer gender from novel personal names based on their phonology, as previously shown by Cai et al. (2024). In conclusion, "MacBehaviour" is a user-friendly R package that simplifies and standardises the experimental process for machine behaviour studies, offering a valuable tool for researchers in this field.

摘要

近年来,对大语言模型(LLMs)和由大语言模型驱动的聊天机器人的研究受到了广泛关注,研究人员将大语言模型视为心理实验的参与者。为了推动这项研究,我们开发了一个名为“MacBehaviour ”(https://github.com/xufengduan/MacBehaviour )的R包,它可以与100多种大语言模型进行交互,包括OpenAI的GPT系列、Claude系列、Gemini、Llama系列以及其他开源模型。该包通过提供一套全面的函数,用于实验设计、刺激呈现、模型行为操纵以及记录响应和令牌概率,简化了大语言模型行为实验的流程。只需几行代码,研究人员就可以无缝地设置和进行心理实验,使大语言模型行为研究变得非常容易。为了验证“MacBehaviour”的实用性和有效性,我们对GPT-3.5 Turbo、Llama-2-7b-chat-hf和Vicuna-1.5-13b进行了三项实验,重现了大语言模型中的声音-性别关联。结果一致表明,这些大语言模型表现出类似人类的倾向,即根据新出现的人名的语音来推断性别,正如Cai等人(2024年)之前所表明的那样。总之,“MacBehaviour”是一个用户友好的R包,它简化并规范了机器行为研究的实验过程,为该领域的研究人员提供了一个有价值的工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b1e/11655609/19bdd05e5639/13428_2024_2524_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b1e/11655609/8fb5e74d1cf5/13428_2024_2524_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b1e/11655609/b6a7da0c4fb4/13428_2024_2524_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b1e/11655609/19bdd05e5639/13428_2024_2524_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b1e/11655609/8fb5e74d1cf5/13428_2024_2524_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b1e/11655609/b6a7da0c4fb4/13428_2024_2524_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b1e/11655609/19bdd05e5639/13428_2024_2524_Fig3_HTML.jpg

相似文献

1
MacBehaviour: An R package for behavioural experimentation on large language models.MacBehaviour:一个用于对大语言模型进行行为实验的R软件包。
Behav Res Methods. 2024 Dec 18;57(1):19. doi: 10.3758/s13428-024-02524-y.
2
Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis.大型语言模型防范生成健康类虚假信息的现行保障措施、风险缓解措施和透明度措施:重复横断面分析。
BMJ. 2024 Mar 20;384:e078538. doi: 10.1136/bmj-2023-078538.
3
Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions.评估大型语言模型在医学、非专业语言和遗传状况的自我报告描述方面的表现。
Am J Hum Genet. 2024 Sep 5;111(9):1819-1833. doi: 10.1016/j.ajhg.2024.07.011. Epub 2024 Aug 14.
4
Programming Chatbots Using Natural Language: Generating Cervical Spine MRI Impressions.使用自然语言编程聊天机器人:生成颈椎MRI影像报告
Cureus. 2024 Sep 14;16(9):e69410. doi: 10.7759/cureus.69410. eCollection 2024 Sep.
5
Virtual Patients Using Large Language Models: Scalable, Contextualized Simulation of Clinician-Patient Dialogue With Feedback.使用大语言模型的虚拟患者:具有反馈功能的临床医生-患者对话的可扩展、情境化模拟
J Med Internet Res. 2025 Apr 4;27:e68486. doi: 10.2196/68486.
6
Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports.在从自由文本报告中提取胸部X光检查结果方面,确保隐私的开放权重大型语言模型与封闭权重的GPT-4o具有竞争力。
Radiology. 2025 Jan;314(1):e240895. doi: 10.1148/radiol.240895.
7
Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.评估大型语言模型与人类心理健康整合价值观的一致性:使用施瓦茨基本价值观理论的横断面研究。
JMIR Ment Health. 2024 Apr 9;11:e55988. doi: 10.2196/55988.
8
InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative Analysis.InfectA-Chat,一种用于传染病的阿拉伯语大语言模型:比较分析。
JMIR Med Inform. 2025 Feb 10;13:e63881. doi: 10.2196/63881.
9
Comparing large Language models and human annotators in latent content analysis of sentiment, political leaning, emotional intensity and sarcasm.在情感、政治倾向、情感强度和讽刺的潜在内容分析中比较大语言模型和人工注释者。
Sci Rep. 2025 Apr 3;15(1):11477. doi: 10.1038/s41598-025-96508-3.
10
Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation.用于从非结构化和半结构化电子健康记录中提取数据的大语言模型:多模型性能评估
BMJ Health Care Inform. 2025 Jan 19;32(1):e101139. doi: 10.1136/bmjhci-2024-101139.

本文引用的文献

1
The acceptability and validity of AI-generated psycholinguistic stimuli.人工智能生成的心理语言刺激的可接受性和有效性。
Heliyon. 2025 Jan 17;11(2):e42083. doi: 10.1016/j.heliyon.2025.e42083. eCollection 2025 Jan 30.
2
Language models, like humans, show content effects on reasoning tasks.语言模型和人类一样,在推理任务中表现出内容效应。
PNAS Nexus. 2024 Jul 16;3(7):pgae233. doi: 10.1093/pnasnexus/pgae233. eCollection 2024 Jul.
3
Emergent analogical reasoning in large language models.大语言模型中的紧急类比推理。
Nat Hum Behav. 2023 Sep;7(9):1526-1541. doi: 10.1038/s41562-023-01659-w. Epub 2023 Jul 31.
4
Do Large Language Models Know What Humans Know?大语言模型了解人类的知识吗?
Cogn Sci. 2023 Jul;47(7):e13309. doi: 10.1111/cogs.13309.
5
Using cognitive psychology to understand GPT-3.利用认知心理学理解 GPT-3。
Proc Natl Acad Sci U S A. 2023 Feb 7;120(6):e2218523120. doi: 10.1073/pnas.2218523120. Epub 2023 Feb 2.
6
Shared computational principles for language processing in humans and deep language models.人类和深度语言模型语言处理的共享计算原则。
Nat Neurosci. 2022 Mar;25(3):369-380. doi: 10.1038/s41593-022-01026-4. Epub 2022 Mar 7.
7
Emergent linguistic structure in artificial neural networks trained by self-supervision.自我监督训练的人工神经网络中的紧急语言结构。
Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30046-30054. doi: 10.1073/pnas.1907367117. Epub 2020 Jun 3.
8
Machine behaviour.机器行为。
Nature. 2019 Apr;568(7753):477-486. doi: 10.1038/s41586-019-1138-y. Epub 2019 Apr 24.
9
Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner.人工智能时代的认知科学:逆向工程婴儿语言学习者的路线图。
Cognition. 2018 Apr;173:43-59. doi: 10.1016/j.cognition.2017.11.008. Epub 2018 Jan 8.