文献检索，用中文搜 PubMed

: Artificial intelligence (AI), particularly large language models (LLMs), has demonstrated versatility in various applications but faces challenges in specialized domains like neurology. This study evaluates a specialized LLM's capability and trustworthiness in complex neurological diagnosis, comparing its performance to neurologists in simulated clinical settings. : We deployed GPT-4 Turbo (OpenAI, San Francisco, CA, US) through Neura (Sciense, New York, NY, US), an AI infrastructure with a dual-database architecture integrating "long-term memory" and "short-term memory" components on a curated neurological corpus. Five representative clinical scenarios were presented to 13 neurologists and the AI system. Participants formulated differential diagnoses based on initial presentations, followed by definitive diagnoses after receiving conclusive clinical information. Two senior academic neurologists blindly evaluated all responses, while an independent investigator assessed the verifiability of AI-generated information. : AI achieved a significantly higher normalized score (86.17%) compared to neurologists (55.11%, < 0.001). For differential diagnosis questions, AI scored 85% versus 46.15% for neurologists, and for final diagnosis, 88.24% versus 70.93%. AI obtained 15 maximum scores in its 20 evaluations and responded in under 30 s compared to neurologists' average of 9 min. All AI-provided references were classified as relevant with no hallucinatory content detected. : A specialized LLM demonstrated superior diagnostic performance compared to practicing neurologists across complex clinical challenges. This indicates that appropriately harnessed LLMs with curated knowledge bases can achieve domain-specific relevance in complex clinical disciplines, suggesting potential for AI as a time-efficient asset in clinical practice.

人工智能（AI），尤其是大语言模型（LLMs），已在各种应用中展现出通用性，但在神经学等专业领域面临挑战。本研究评估了一种专门的大语言模型在复杂神经诊断中的能力和可信度，并在模拟临床环境中将其表现与神经科医生进行比较。

我们通过Neura（美国纽约州纽约市Sciense公司）部署了GPT-4 Turbo（美国加利福尼亚州旧金山OpenAI公司），Neura是一种人工智能基础设施，具有双数据库架构，在经过整理的神经学语料库上集成了“长期记忆”和“短期记忆”组件。向13名神经科医生和人工智能系统呈现了五个具有代表性的临床场景。参与者根据初始表现制定鉴别诊断，在收到确凿的临床信息后得出最终诊断。两名资深学术神经科医生对所有回答进行盲评，同时一名独立调查员评估人工智能生成信息的可验证性。

与神经科医生（55.11%，<0.001）相比，人工智能获得了显著更高的标准化分数（86.17%）。对于鉴别诊断问题，人工智能的得分是85%，而神经科医生为46.15%；对于最终诊断，人工智能为88.24%，神经科医生为70.93%。人工智能在其20次评估中获得了15个最高分，且回答时间不到30秒，而神经科医生的平均回答时间为9分钟。所有人工智能提供的参考文献均被归类为相关，未检测到幻觉内容。

与执业神经科医生相比，一种专门的大语言模型在复杂的临床挑战中表现出卓越的诊断性能。这表明，利用经过整理的知识库适当使用大语言模型可以在复杂的临床学科中实现特定领域的相关性，这表明人工智能在临床实践中作为一种节省时间的资产具有潜力。