• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型与人类价值观的强对齐和弱对齐。

Strong and weak alignment of large language models with human values.

机构信息

Institute of Intelligent Systems and Robotics, Sorbonne University/CNRS, 75005, Paris, France.

出版信息

Sci Rep. 2024 Aug 21;14(1):19399. doi: 10.1038/s41598-024-70031-3.

DOI:10.1038/s41598-024-70031-3
PMID:39169090
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11339283/
Abstract

Minimizing negative impacts of Artificial Intelligent (AI) systems on human societies without human supervision requires them to be able to align with human values. However, most current work only addresses this issue from a technical point of view, e.g., improving current methods relying on reinforcement learning from human feedback, neglecting what it means and is required for alignment to occur. Here, we propose to distinguish strong and weak value alignment. Strong alignment requires cognitive abilities (either human-like or different from humans) such as understanding and reasoning about agents' intentions and their ability to causally produce desired effects. We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted. To illustrate this distinction, we present a series of prompts showing ChatGPT's, Gemini's and Copilot's failures to recognize some of these situations. We moreover analyze word embeddings to show that the nearest neighbors of some human values in LLMs differ from humans' semantic representations. We then propose a new thought experiment that we call "the Chinese room with a word transition dictionary", in extension of John Searle's famous proposal. We finally mention current promising research directions towards a weak alignment, which could produce statistically satisfying answers in a number of common situations, however so far without ensuring any truth value.

摘要

在没有人为监督的情况下,最小化人工智能 (AI) 系统对人类社会的负面影响,需要它们能够与人类价值观保持一致。然而,目前大多数工作仅从技术角度来解决这个问题,例如改进依赖于人类反馈的强化学习等现有方法,而忽略了对齐发生所需的含义和要求。在这里,我们建议区分强对齐和弱对齐。强对齐需要认知能力(与人类相似或与人类不同),例如理解和推理代理的意图及其产生所需效果的能力。我们认为,对于像大型语言模型 (LLM) 这样的 AI 系统来说,这是识别可能违反人类价值观的情况所需的。为了说明这种区别,我们提出了一系列提示,展示了 ChatGPT、Gemini 和 Copilot 无法识别其中一些情况。此外,我们还分析了词嵌入,以表明在 LLM 中,一些人类价值观的最近邻与人类的语义表示不同。然后,我们提出了一个新的思想实验,我们称之为“带有单词转换字典的中文房间”,这是对 John Searle 著名提议的扩展。最后,我们提到了目前朝着弱对齐方向发展的有前景的研究方向,这些方向在许多常见情况下可以产生统计上令人满意的答案,但到目前为止,并没有保证任何真值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/119e/11339283/cd460537b18b/41598_2024_70031_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/119e/11339283/5419e7aa5945/41598_2024_70031_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/119e/11339283/da3a2012703f/41598_2024_70031_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/119e/11339283/a6e748749164/41598_2024_70031_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/119e/11339283/d09b2a68fe44/41598_2024_70031_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/119e/11339283/5b10ddcef0e8/41598_2024_70031_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/119e/11339283/cd460537b18b/41598_2024_70031_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/119e/11339283/5419e7aa5945/41598_2024_70031_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/119e/11339283/da3a2012703f/41598_2024_70031_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/119e/11339283/a6e748749164/41598_2024_70031_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/119e/11339283/d09b2a68fe44/41598_2024_70031_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/119e/11339283/5b10ddcef0e8/41598_2024_70031_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/119e/11339283/cd460537b18b/41598_2024_70031_Fig6_HTML.jpg

相似文献

1
Strong and weak alignment of large language models with human values.大型语言模型与人类价值观的强对齐和弱对齐。
Sci Rep. 2024 Aug 21;14(1):19399. doi: 10.1038/s41598-024-70031-3.
2
Deception abilities emerged in large language models.大型语言模型中出现了欺骗能力。
Proc Natl Acad Sci U S A. 2024 Jun 11;121(24):e2317967121. doi: 10.1073/pnas.2317967121. Epub 2024 Jun 4.
3
Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.评估大型语言模型与人类心理健康整合价值观的一致性:使用施瓦茨基本价值观理论的横断面研究。
JMIR Ment Health. 2024 Apr 9;11:e55988. doi: 10.2196/55988.
4
Artificial Intelligence Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles: Pandora's Box Has Been Opened.人工智能可以生成虚假但看起来真实的科学医学文章:潘多拉的盒子已经被打开。
J Med Internet Res. 2023 May 31;25:e46924. doi: 10.2196/46924.
5
The Role of Humanization and Robustness of Large Language Models in Conversational Artificial Intelligence for Individuals With Depression: A Critical Analysis.大型语言模型的人性化和稳健性在抑郁症患者对话人工智能中的作用:批判性分析。
JMIR Ment Health. 2024 Jul 2;11:e56569. doi: 10.2196/56569.
6
Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.ChatGPT 在临床医学研究生入学考试中的表现:调查研究。
JMIR Med Educ. 2024 Feb 9;10:e48514. doi: 10.2196/48514.
7
Utility of artificial intelligence-based large language models in ophthalmic care.人工智能大型语言模型在眼科护理中的应用。
Ophthalmic Physiol Opt. 2024 May;44(3):641-671. doi: 10.1111/opo.13284. Epub 2024 Feb 25.
8
Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals.大语言模型与用户信任:自我参照学习循环的后果及医疗保健专业人员的技能退化
J Med Internet Res. 2024 Apr 25;26:e56764. doi: 10.2196/56764.
9
Symbol ungrounding: what the successes (and failures) of large language models reveal about human cognition.符号去 grounding:大型语言模型的成功(和失败)揭示了人类认知的什么。
Philos Trans R Soc Lond B Biol Sci. 2024 Oct 7;379(1911):20230149. doi: 10.1098/rstb.2023.0149. Epub 2024 Aug 19.
10
Artificial intelligence in clinical pharmacology: A case study and scoping review of large language models and bioweapon potential.人工智能在临床药理学中的应用:大型语言模型和生物武器潜力的案例研究和范围综述。
Br J Clin Pharmacol. 2024 Mar;90(3):620-628. doi: 10.1111/bcp.15899. Epub 2023 Sep 24.

引用本文的文献

1
Gen AI and research integrity: Where to now? : The integration of Generative AI in the research process challenges well-established definitions of research integrity.生成式人工智能与研究诚信:何去何从?:生成式人工智能在研究过程中的整合对已确立的研究诚信定义构成了挑战。
EMBO Rep. 2025 Apr;26(8):1923-1928. doi: 10.1038/s44319-025-00424-6. Epub 2025 Mar 24.
2
Navigating artificial general intelligence development: societal, technological, ethical, and brain-inspired pathways.驾驭通用人工智能发展:社会、技术、伦理及受大脑启发的路径。
Sci Rep. 2025 Mar 11;15(1):8443. doi: 10.1038/s41598-025-92190-7.

本文引用的文献

1
Use of large language models might affect our cognitive skills.使用大语言模型可能会影响我们的认知技能。
Nat Hum Behav. 2024 May;8(5):805-806. doi: 10.1038/s41562-024-01859-y.
2
Infants Infer Social Relationships Between Individuals Who Engage in Imitative Social Interactions.婴儿能推断出参与模仿性社交互动的个体之间的社会关系。
Open Mind (Camb). 2024 Mar 5;8:202-216. doi: 10.1162/opmi_a_00124. eCollection 2024.
3
Generating meaning: active inference and the scope and limits of passive AI.生成意义:主动推理与被动人工智能的范围及局限
Trends Cogn Sci. 2024 Feb;28(2):97-112. doi: 10.1016/j.tics.2023.10.002. Epub 2023 Nov 15.
4
Do We Collaborate With What We Design?我们是否与我们所设计的事物协作?
Top Cogn Sci. 2023 Aug 15. doi: 10.1111/tops.12682.
5
Using cognitive psychology to understand GPT-3.利用认知心理学理解 GPT-3。
Proc Natl Acad Sci U S A. 2023 Feb 7;120(6):e2218523120. doi: 10.1073/pnas.2218523120. Epub 2023 Feb 2.
6
A Conversation on Artificial Intelligence, Chatbots, and Plagiarism in Higher Education.关于高等教育中的人工智能、聊天机器人和剽窃的对话
Cell Mol Bioeng. 2023 Jan 2;16(1):1-2. doi: 10.1007/s12195-022-00754-8. eCollection 2023 Feb.
7
Word meaning in minds and machines.思维与机器中的词义。
Psychol Rev. 2023 Mar;130(2):401-431. doi: 10.1037/rev0000297. Epub 2021 Jul 22.
8
Human- versus Artificial Intelligence.人类与人工智能
Front Artif Intell. 2021 Mar 25;4:622364. doi: 10.3389/frai.2021.622364. eCollection 2021.
9
Toward Self-Aware Robots.迈向具有自我意识的机器人。
Front Robot AI. 2018 Aug 13;5:88. doi: 10.3389/frobt.2018.00088. eCollection 2018.
10
Beyond dichotomies in reinforcement learning.超越强化学习中的二分法。
Nat Rev Neurosci. 2020 Oct;21(10):576-586. doi: 10.1038/s41583-020-0355-6. Epub 2020 Sep 1.