• Suppr超能文献
  • 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
定价套餐&价格
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Win 客户端微信小程序
定价
会员套餐积分包API 积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026
  1. 首页
  2. 分享广场
  3. 手术视频大语言模型训练技术最新进展与临床应用前景综述

手术视频大语言模型训练技术最新进展与临床应用前景综述

深度研究匿名用户发表于 2026年05月03日 12:0339阅读
发起深度研究
发起深度研究

1. 手术视频大语言模型的研究背景与价值

1.1 手术场景智能化升级的临床需求

当前临床手术流程复杂且对精度要求极高,传统的手术管理模式在操作规范质控、技能培训和术中辅助等多个环节面临诸多挑战。首先,在操作规范质控方面,手术过程中的每一个细节都可能影响患者预后,但人工质控效率低下且易受主观因素影响。例如,结肠镜检查等内窥镜手术,其过程中的解剖分段和操作阶段的自动分割对于自动化报告和质控至关重要,但目前仍是研究的重点和难点 1。

其次,在技能培训方面,外科医生培养周期长,传统“师带徒”模式受限于手术机会、带教资源和学徒实践的安全性。工作时间的限制和对手术效率的追求,使得高质量的骨科手术培训面临挑战 2。模拟训练已成为几乎所有外科专业的重要辅助手段 23,包括虚拟现实模拟器在内,为骨科手术提供了多种模拟平台,帮助受训者更好地适应手术室环境 2。例如,颞骨解剖的复杂性使得其手术技能获取困难,而疫情期间传统手术训练机会的减少,进一步凸显了虚拟颞骨模拟器在外科培训中的重要性 4。新生儿微创手术等复杂术式,由于病例稀少,使得住院医师的临床实践机会受限,高质量、低成本的模拟器对培训至关重要 5。模拟训练提供了一个安全的实践环境,可以有效地降低学习曲线,但不能完全替代真实手术情境,因此需要将其整合到全面的培训课程中 3。如何利用手术视频对训练过程进行量化评估,提供个性化的反馈,是当前技能培训的迫切需求。

最后,在术中辅助决策方面,外科医生需要在高压力和不确定性环境下做出复杂且高风险的决策,这些决策对患者结局有显著影响 6。传统临床决策支持系统存在数据管理耗时、准确性不足等问题,而人工智能(AI)有望通过实时分析电子病历数据和移动设备输出,增强手术决策能力 6。例如,在消化内镜领域,AI通过提高诊断准确性、减少医生工作量、为临床诊断和治疗提供依据,展现出巨大的应用价值 7。然而,要实现更深层次的术中辅助,如实时识别手术步骤、预警并发症、提供下一步操作建议等,需要对手术视频内容进行实时、精准的语义理解。例如,术后恶心呕吐(PDNV)的风险评估和管理,需要医生在术前对患者进行风险评估,并在术后提供相应的干预措施 8。如果能在术中实时评估患者状态并给出风险提示,将大大提升患者安全性。

综上所述,手术视频的语义分析是实现手术流程智能化升级的关键。通过对手术视频内容的深入理解,能够有效解决上述痛点,推动临床实践向更安全、高效和个性化的方向发展。

1.2 大语言模型适配手术视频处理的技术逻辑

大语言模型(LLMs)凭借其强大的多模态理解与生成能力,为手术视频的处理和智能化分析提供了全新的技术范式。传统计算机视觉模型在处理手术视频时,往往侧重于特定的视觉任务,如工具分割 9、器官识别 10 或工作流识别 11。然而,这些模型通常难以对复杂的手术场景进行高层次的语义推理和跨模态信息整合,也无法将视觉信息直接转化为医生易于理解和使用的临床语言。

大语言模型的核心优势在于其卓越的自然语言处理能力和多模态语义理解特性。首先,LLMs能够将手术视频中提取的视觉特征与丰富的医学文本知识库进行关联,实现从像素到语义的深度理解。通过与视觉-语言模型(Vision-Language Models)的结合,可以将视频帧级的视觉信息转换为结构化的文本描述,进而输入到LLMs中进行处理 1213。这种方法使得LLMs能够理解视频内容,并在此基础上进行更高级的推理。例如,多模态大语言模型(MLLMs)能够接收交错的多模态输入(如图像和文本),并生成文本甚至图像输出,表现出强大的能力 14。

其次,LLMs能够通过其内在的知识推理能力,将手术视频中的操作动作、解剖结构变化、并发症迹象等视觉信息,转化为符合临床逻辑的叙述性文本或结构化报告。这种“视觉信息向临床可解析文本信息转换”的能力,是实现手术智能化升级的关键。例如,PathChat作为一个视觉-语言通用AI助手,能够处理人类病理学中的视觉和自然语言输入,并在诊断问题和开放式查询方面表现出卓越的准确性和病理学家偏好的响应,证明了多模态LLMs在医疗领域强大的应用潜力 15。NExT-GPT等系统更进一步,实现了任意模态到任意模态的转换,能够处理并生成文本、图像、视频和音频的任意组合,这为手术视频的全面理解和多维信息输出奠定了基础 16。

此外,大语言模型的长上下文处理能力也至关重要。例如,Gemini 1.5模型能够处理数百万token的上下文,包括长时间的视频和音频,并在此基础上进行精细信息的召回和推理,展现出近乎完美的长期上下文检索能力 17。这意味着LLMs可以分析整个手术过程中的连续视频流,而不仅仅是孤立的片段,从而更好地理解手术的整体进展和上下文信息,这对于手术工作流的准确识别和潜在风险的预警至关重要 18。

总而言之,大语言模型通过整合多模态数据处理、深层语义理解、知识推理和长上下文分析等技术,能够有效地将手术视频中复杂的视觉信息,转化为医生可以理解和应用的临床洞察与决策支持,从而为手术场景的智能化升级提供核心技术支撑,推动计算机辅助手术系统向更智能、更自主的方向发展。

2. 手术视频大语言模型训练的核心技术框架

2.1 多模态训练数据预处理体系

手术视频大语言模型(SV-LLMs)的成功训练,离不开高质量、多模态的训练数据。这些数据需要经过精细化的预处理,以确保模型能够有效学习手术过程中的视觉特征、操作动作和临床语义。其标准化预处理流程主要包括手术视频帧采样、操作动作特征提取、手术语义节点标注以及非结构化临床文本对齐等关键环节。

2.1.1 手术视频帧采样与质量控制
手术视频通常时长较长,包含大量冗余信息。为了提高训练效率和模型性能,需要进行有效的帧采样。常见的策略包括:

  • 固定间隔采样: 按照预设的时间间隔或帧数间隔进行均匀采样,确保时间序列上的连续性。
  • 关键帧提取: 利用图像处理技术识别视频中的关键帧,例如显著的场景变化、器械切换或重要操作开始/结束的帧。这种方法可以减少冗余,同时保留关键信息。
  • 模糊帧与无效帧检测: 手术视频中常存在因器械遮挡、镜头污染或快速运动导致的模糊帧和无效帧。这些低质量帧会干扰模型的学习。先进的方法如使用卷积LSTM网络(ConvLSTM)检测模糊分数,可以有效识别并去除88.3%的模糊帧,从而将分类准确率提升至95.2% 19。此外,去除无效的视频帧是自动化剪辑任务中的一项挑战,因为这些帧的视觉特征不明显,容易导致误分类 19。

2.1.2 操作动作特征提取
操作动作特征是理解手术过程核心要素。为了将这些视觉信息转化为模型可理解的特征,需要进行多方面的提取:

  • 器械轨迹与姿态: 追踪手术器械的尖端轨迹和三维姿态。通过光电运动分析系统可以获取这些运动学数据,用于评估外科医生的表现,并形成可重复的性能衡量标准 20。
  • 手势识别: 识别外科医生在操作中使用的精细手势,例如抓取、切割、缝合等。机器学习,尤其是深度学习方法,在机器人手术中的手势识别方面已取得显著进展,能够从多模态数据中提取判别性特征 21。然而,要实现鲁棒的识别,仍需大规模、多样化的带标注数据集 21。
  • 特征空间构建: 针对腹腔镜训练任务,可以构建新颖的操作行为表示特征空间(Maneuver Representation Feature Space, MRFS),通过追踪抓持器边缘消失点,实现新手与专家之间96%的分类准确率,在已知任务情况下甚至超过98% 22。
  • 多模态融合特征: 结合来自不同传感器(如力反馈、眼动追踪等)的数据,形成更全面的操作特征向量。

2.1.3 手术语义节点标注
手术语义节点标注是构建高质量监督学习数据集的关键步骤,它将低层次的视觉特征与高层次的临床意义关联起来。

  • 分层标注体系: 针对开放手术等复杂场景,可采用多层级标注体系,包括视频层级(如手术类型)、操作层级(如手术阶段、步骤)和帧层级(如关键动作、器械交互)。例如,OpenSurgery数据集就包含了843个开放手术视频,涵盖20多种手术类型,并由专家医生在视频、操作和帧三个层级进行细致标注,以确保数据的高质量和临床适用性 23。
  • 标准化术语与本体: 采用统一的医学术语和本体论(Ontology)进行标注,例如使用国际疾病分类(ICD)或SNOMED CT等标准,确保语义的一致性和可互操作性。语义标注有助于将非结构化的临床文本数据转化为可分析的结构化信息,例如对心脏手术患者的疼痛过程进行语义标注,可以发现疼痛原因、情境、特征、后果、措施和结果等六个方面 24。
  • 专家医生参与: 手术视频的语义标注通常需要经验丰富的专家医生参与,以保证标注的准确性和临床相关性。这使得标注过程成本较高,但对于模型理解复杂手术逻辑至关重要。

2.1.4 非结构化临床文本对齐
手术视频往往伴随着大量的非结构化临床文本,如手术记录、病程记录、护理记录等。将这些文本与视频内容进行有效对齐,可以为模型提供丰富的上下文信息和知识。

  • 时间戳对齐: 通过分析手术记录中提及的关键时间点或事件,将其与视频中的相应片段进行时间戳上的对齐。
  • 语义匹配: 利用自然语言处理技术,识别文本中描述的手术步骤、并发症、使用的器械等信息,并与视频中对应的视觉事件进行匹配。例如,通过本体论和规则引擎,可以从电子病历数据中识别和分类医疗相关感染(如手术部位感染),辅助风险评估 25。
  • 多模态预训练: 将视频帧序列与相应的文本描述作为输入,进行多模态预训练,使模型学习视频与文本之间的内在关联。这对于建立统一的表示空间至关重要。例如,通过对比学习策略和动态时间规整(DTW)损失,可以实现视频与文本的精细化时间对齐,捕获视觉语义的时间演变 23。

通过上述精细化的多模态训练数据预处理体系,SV-LLMs能够从复杂的原始手术视频数据中提取出高质量、富含语义的信息,为后续的模型训练和临床应用奠定坚实基础。

2.2 领域适配的模型微调技术

大语言模型在通用领域展现出强大的能力,但直接应用于手术视频分析等特定医学领域时,往往面临领域知识不足、专业术语理解偏差以及特定任务性能不佳等问题。因此,对通用大模型进行领域适配的微调(Fine-tuning)是提升其对手术专属语义识别精度的关键。这种微调通常包括增量预训练、向量嵌入增强以及小样本学习等策略。

2.2.1 增量预训练 (Incremental Pre-training)

增量预训练是指在通用大模型的基础上,利用大量领域内无标注数据进行进一步的预训练,使模型更好地学习和理解特定领域的语言模式和知识。对于手术视频大语言模型而言,这通常涉及使用以下类型的领域数据:

  • 海量的医学文本数据: 包括医学教科书、期刊论文、临床指南、电子病历、手术记录、影像报告等。通过这些数据,模型可以学习丰富的医学术语、疾病描述、诊断标准和治疗方案等专业知识。例如,有研究通过对基于BERT的模型进行进一步预训练,使用了来自癌症患者的662,579份未标记影像报告数据集,以提高自然语言处理(NLP)在癌症结果提取方面的性能 26。在临床医学中,有LLM框架通过检索精选的医疗资源来增强其能力,从而在医学指南和治疗建议方面表现出显著改进 27。
  • 手术操作手册与解说: 包含详细的手术步骤、器械使用说明、解剖结构描述等,帮助模型理解手术的流程和语义。
  • 多模态医疗数据: 除了纯文本数据,还可以将医学图像、视频的文本描述等数据纳入增量预训练,进一步增强模型的多模态理解能力。

增量预训练的目的是将通用知识与专业医学知识相结合,使得模型在理解手术视频内容时,能够调用更精准的医学背景知识,从而提高对复杂手术场景的推理和描述能力。

2.2.2 向量嵌入增强 (Vector Embedding Enhancement)

向量嵌入是表示词汇、短语或更高级语义单元的数值向量。在医疗领域,通过增强这些嵌入可以提高模型对专业概念的理解:

  • 医学本体与词典嵌入: 结合医学本体(如SNOMED CT, ICD-10)和专业医学词典,将医学概念映射到高维向量空间。这些预先训练的医学概念嵌入可以作为模型输入的一部分,或用于初始化模型内部的嵌入层。例如,通过本体和规则引擎识别和分类医疗相关感染的语义标注,可以辅助风险评估。
  • 知识图谱增强: 将医学知识图谱(Medical Knowledge Graph)中的实体和关系编码为向量,并通过注意力机制或图神经网络与视频特征和文本嵌入进行融合。这有助于模型理解医学概念之间的复杂关系,例如疾病与症状、药物与副作用、手术步骤与风险等。有研究发现,利用生物医学知识图谱筛选医学大语言模型输出,可以捕获91.9%的有害内容,并提供了一种验证模型输出的独特方法 28。
  • 多模态对齐嵌入: 通过对比学习(Contrastive Learning)等技术,将手术视频的视觉特征和对应的文本描述映射到共享的嵌入空间,使得相似的视觉内容和文本描述在嵌入空间中距离更近。这能显著提高模型对视频内容的语义理解和跨模态检索能力。例如,Referring Surgical Video Instrument Segmentation (RSVIS) 任务中,就通过Video-Instrument Synergistic Network (VIS-Net) 和 Graph-based Relation-aware Module (GRM) 建模多模态信息(文本描述和视频帧)之间的关联,以促进器械级信息的提取和分割,其性能显著优于现有方法 29。

2.2.3 小样本学习 (Few-Shot Learning)

由于高质量的手术视频标注数据获取成本高昂且需要专业医生投入大量精力,因此在实际应用中,模型往往需要在有限的标注样本下进行学习。小样本学习(Few-Shot Learning)策略在此背景下显得尤为重要:

  • 迁移学习 (Transfer Learning): 利用在大型通用数据集(如ImageNet或大型文本语料库)上预训练的模型作为起点,然后针对特定手术任务进行微调。这种方法利用了预训练模型学到的通用特征表示,即使在小样本数据集上也能取得良好的性能。例如,在腹腔镜视频中识别器械,通过对预训练模型的微调,比从零开始训练模型更快更稳定 30。在内窥镜伪影检测中,基于预训练模型和微调的深度迁移学习方法也取得了先进的性能 31。机器人辅助手术评估中,利用预训练模型可以显著减少对临床数据的需求,并提高模型精度 32。
  • 元学习 (Meta-Learning): 训练模型“学会学习”,使其能够快速适应新的、只有少量样本的任务。通过在多个相关任务上进行训练,模型可以学习到一种通用的学习策略,从而在新任务上通过少量样本即可达到较好的性能。
  • 数据增强 (Data Augmentation): 通过对有限的标注数据进行旋转、裁剪、翻转、色彩变换等操作,生成更多的训练样本,以扩充数据集,减少过拟合风险。在结肠镜息肉检测中,数据增强技术结合迁移学习被用于提升模型性能 33。
  • 提示学习 (Prompt Learning): 针对大语言模型,可以通过设计合适的“提示”(Prompt)模板,将小样本任务转化为模型可以理解的、具有上下文的任务描述。例如,为模型提供少量相关的“输入-输出”示例,引导模型在没有直接训练的情况下完成相似任务。这种方法在零样本或少样本设置下,可以利用大语言模型的零样本能力进行知识推理 26。

通过这些领域适配的微调技术,手术视频大语言模型能够克服通用模型在医学领域的局限性,更准确、更深入地理解手术过程中的视觉和语义信息,从而为临床应用提供更可靠的智能化支持。

2.3 模型性能验证体系

手术视频大语言模型(SV-LLMs)在临床环境中实现可靠应用,其性能的严格验证至关重要。这需要一套多维度、细致入微的评估指标和验证方法,以确保模型不仅在技术层面表现优异,更能在临床实践中提供安全、准确且有价值的辅助。主要的评估维度包括手术步骤识别准确率、操作语义理解匹配度、以及临床决策建议合理性。

2.3.1 手术步骤识别准确率

手术步骤识别是SV-LLMs理解手术过程的基础,其准确性直接影响后续的语义分析和决策支持。评估指标主要包括:

  • 准确率 (Accuracy): 最直接的指标,表示模型正确识别的手术步骤数量占总步骤数的比例。
  • 召回率 (Recall) 和精确率 (Precision): 召回率衡量模型识别出所有真实步骤的能力,精确率衡量模型识别出的步骤中有多少是真实的。在医疗场景中,通常需要权衡两者的关系,例如在风险预警场景中,高召回率(不漏报)可能比高精确率更为重要。
  • F1分数 (F1-Score): 召回率和精确率的调和平均值,综合反映模型的性能。
  • 交并比 (IoU, Intersection over Union): 对于时间序列上的步骤识别,IoU可以衡量模型预测的步骤时间段与真实标注时间段的重叠程度。特别是在手术工作流分析中,识别不同手术任务的顺序和持续时间是关键。例如,一项关于机器人辅助手术中AI的研究表明,识别下一个手术任务的准确率可以达到75.7% 34。
  • 混淆矩阵 (Confusion Matrix): 详细展示模型在不同手术步骤之间的误识别情况,有助于发现模型在特定步骤上的薄弱点。

验证方法通常涉及:

  • 专家标注数据集: 邀请多位经验丰富的外科医生对手术视频进行精细的步骤标注,形成“黄金标准”数据集。
  • 交叉验证: 将数据集划分为训练集、验证集和测试集,采用K折交叉验证等方法评估模型的泛化能力。
  • 时间敏感性评估: 针对手术步骤的实时识别需求,评估模型在不同延迟下的识别性能。

2.3.2 操作语义理解匹配度

操作语义理解是SV-LLMs的核心能力,它要求模型不仅识别出“做什么”,还要理解“为什么这么做”以及“做得怎么样”。这包括对器械使用、解剖结构识别、操作规范性等方面的理解。评估指标有:

  • 语义准确性 (Semantic Accuracy): 模型生成的文本描述或问答结果与专家提供的参考答案在语义层面的匹配程度。这可以通过自然语言处理(NLP)领域的度量标准来衡量,如BLEU、ROUGE等,但更重要的是结合临床专家的判断。
  • 实体识别与关系提取 (Named Entity Recognition & Relation Extraction): 评估模型能否准确识别手术视频中的医学实体(如特定器械、解剖部位、疾病名称)及其之间的关系。
  • 事件检测与描述 (Event Detection & Description): 评估模型能否准确检测手术中发生的关键事件(如出血、并发症迹象)并提供详细描述。
  • 视觉问答 (Visual Question Answering, VQA) 准确率: 在手术VQA任务中,模型需要根据视频内容回答与手术相关的问题。例如,一项研究提出了LMT++框架,通过多模态LLM和自适应权重分配策略,在解决手术VQA中的领域偏移和数据不平衡问题上超越了现有水平 35。

验证方法包括:

  • 人工评估: 临床专家对模型生成的报告、摘要或问答结果进行人工评审,评估其临床合理性、完整性和准确性。
  • 对比学习: 将模型输出与不同专家之间的标注一致性进行比较,以衡量模型与人类专家的契合度。
  • 对抗性评估: 设计具有挑战性的问题或场景,测试模型在复杂或模糊情况下的语义理解能力。

2.3.3 临床决策建议合理性

SV-LLMs的最终目标是辅助临床决策,因此其提供的建议必须是合理、安全且符合医学伦理的。这一维度更侧重于模型的临床实用性和安全性。

  • 决策支持准确性: 模型基于手术视频分析给出的诊断、风险评估或干预建议与真实临床结果的一致性。例如,AI模型在预测淋巴结转移(LNM)风险方面表现出97.8%的敏感性和15.6%的特异性,但假阴性率仍需谨慎考虑 36。在妇科肿瘤学领域,AI在风险分层、诊断和治疗预测方面显示出前景 37。
  • 可解释性 (Interpretability): 模型不仅要给出决策,还应提供其决策的依据和推理过程,这对于医生建立信任和理解模型输出至关重要。不确定性量化(UQ)在临床决策中扮演关键角色,因为它能提高医疗评估的精确性和可靠性,帮助管理临床数据、诊断工具和治疗结果中的不确定性 38。
  • 安全性 (Safety): 评估模型建议的潜在风险,例如是否可能导致误诊、误操作或延误治疗。
  • 效率提升 (Efficiency Improvement): 模型是否能显著减少医生查阅资料、分析视频的时间,提高工作效率。例如,AI在内窥镜诊断中显著提高了幽门螺杆菌感染的诊断效率 39。
  • 人机协作有效性: 模型建议与医生最终决策的采纳率、医生对模型的满意度等。

验证方法通常是最高标准的:

  • 前瞻性临床试验: 将模型整合到实际临床工作流中,在真实患者数据上进行前瞻性验证,观察其对患者结局、诊疗效率和安全性指标的影响。
  • 专家共识评估: 组织多学科专家对模型的决策建议进行盲评,并形成共识性评价。
  • 伦理审查与合规性评估: 确保模型的应用符合医疗伦理规范和相关法规要求 40。
  • 长期随访: 评估模型辅助决策的长期效果,包括患者预后、并发症发生率等。

通过以上多维度、严谨的性能验证体系,SV-LLMs才能够逐步从实验室走向临床,真正赋能手术医疗的智能化升级。

3. 手术视频大语言模型训练的最新研究进展

3.1 通用技术突破进展

近年来,手术视频大语言模型在通用技术层面取得了显著突破,尤其是在端到端手术视频-文本生成模型和实时手术语义解析模型方面。这些进展不仅提升了模型的性能,也为未来临床应用奠定了基础。

3.1.1 端到端手术视频-文本生成模型

端到端手术视频-文本生成模型旨在直接从原始手术视频中提取信息,并生成与之对应的自然语言描述或摘要。这一领域的最新进展主要体现在以下几个方面:

  • 扩散模型(Diffusion Models)的应用: 扩散模型在图像和视频生成领域展现出强大的能力。研究者们开始将其应用于手术视频生成,以创建更真实、多样且具有良好时间连贯性的手术视频。例如,SurGen 模型就是一种文本引导的扩散模型,专门用于手术视频合成。它在现有手术视频生成模型中实现了最高的图像分辨率和最长的视频持续时间。通过在手术数据上训练的深度学习分类器,SurGen能够验证生成视频的视觉和时间质量,并评估其与相应文本提示的对齐程度。SurGen的成功证明了扩散模型在改善手术教育方面的巨大潜力,能够提供更真实、多样和互动的模拟环境 41。

    类似地,也有研究利用扩散模型交互式地生成腹腔镜视频,进一步探索了该技术在医疗模拟和训练中的应用潜力 42。此外,Ophora 作为一个大规模数据驱动的文本引导眼科手术视频生成模型,也展示了扩散模型在特定专科手术视频生成方面的能力 43。这些生成模型可以根据文本提示生成手术视频,这对于外科培训、手术规划和医学研究具有重要意义 44。

  • 多模态大语言模型(Multimodal Large Language Models, MLLMs)的兴起: MLLMs能够处理和整合多种模态的数据,包括文本、图像、视频和音频。这使得它们能够更好地理解手术视频中的复杂信息,并生成更全面的文本描述。例如,有研究提出了一个综合框架,将M-LLMs应用于医疗领域,能够处理医学图像(如MRI和CT扫描)、时间序列数据、音频记录、文本和视频(如手术过程)等多种数据类型,并讨论了其在医疗领域的应用、挑战和未来展望 13。这表明,将视频特征与文本信息深度融合,是实现高质量手术视频-文本生成的关键。

  • 长上下文理解与推理: 手术视频通常时长较长,包含大量连续性的操作。传统的模型往往难以处理如此长的序列信息。目前,研究正致力于开发能够处理长时间视频和音频上下文的模型,并在其基础上进行精细信息的召回和推理,以更好地理解整个手术过程的逻辑和上下文依赖,从而生成更准确和连贯的手术描述。

3.1.2 实时手术语义解析模型

实时手术语义解析旨在在手术进行过程中,即时识别手术步骤、器械使用、解剖结构以及潜在的风险事件,并将其转化为可理解的语义信息。这对于术中辅助决策和质量控制至关重要。

  • Transformer架构的广泛应用: Vision Transformer (ViT) 及其变种已成为手术视频分析领域的主流架构。例如,EndoViT 模型通过对大量内窥镜图像(Endo700k数据集,包含超过70万张图像)进行预训练,显著提升了模型在内窥镜视频分析中的性能。EndoViT在动作三元组识别任务上超越了ImageNet预训练模型,并在语义分割方面达到了最先进的水平,证明了领域特定大规模自监督预训练的有效性 45。这种预训练策略使得模型能够更好地捕捉内窥镜图像的视觉特征,从而为实时语义解析提供更强的基础。

  • 多任务学习与联合优化: 为了实现实时、全面的语义解析,研究者们开始采用多任务学习的方法,在一个模型中同时处理手术步骤识别、器械分割、事件检测等多个任务。通过联合优化这些任务,模型能够更好地利用不同任务之间的相关性,提升整体性能。例如,有工作在机器人辅助手术中利用AI技术识别下一个手术任务,准确率达到75.7%,这为实时决策支持提供了可能。

  • 轻量化与高效推理: 实时应用对模型的计算效率提出了高要求。因此,研究方向之一是如何设计轻量级、低延迟的模型架构,以实现在手术室环境下的快速推理。这包括模型剪枝、量化以及硬件加速等技术。

  • 对不确定性和异常的鲁棒性: 真实手术环境复杂多变,模型需要对噪声、伪影、罕见事件等具有鲁棒性。这促使研究者们探索基于不确定性量化(Uncertainty Quantification, UQ)的方法,使模型在输出语义信息的同时,能够评估其置信度,从而为医生提供更可靠的辅助。

这些通用技术突破为手术视频大语言模型从实验室走向临床应用奠定了坚实的基础,也预示着未来手术智能化辅助系统将拥有更强的感知、理解和推理能力。

3.2 专科场景适配研究进展

随着通用手术视频大语言模型技术的不断成熟,其在不同外科专科的适配性研究也取得了显著进展。各专科根据自身手术特点、数据可得性和临床需求,开发并测试了定制化的模型,以期更好地解决专科特有的挑战。

3.2.1 普外科

普外科手术种类繁多,包括胆囊切除术、阑尾切除术、胃肠道手术等。这些手术通常涉及复杂的解剖结构和精细的操作。普外科手术视频大模型的研发侧重于:

  • 手术步骤识别与工作流分析: 针对腹腔镜胆囊切除术等常见术式,模型能够实现手术阶段和步骤的自动化识别,精度较高。例如,在自动生成手术报告、实时进度跟踪和标准化培训评估方面展现出潜力。有研究表明,深度学习模型在外科视频中解剖结构分割和目标检测方面取得了显著进展,特别是在普通外科手术(占36.1%)和结直肠外科手术(占14.7%)中,胆囊切除术(26.2%)和低位直肠前切除术(8.2%)是研究最多的手术类型 46。
  • 器械交互与事件检测: 模型能够识别手术器械的类型、使用方式及其与组织结构的交互,并检测潜在的并发症事件,如出血、组织损伤等。
  • 并发症预测: 基于术中视频特征,结合患者术前数据,预测术后并发症风险,如术后胰瘘、吻合口瘘等。

挑战在于普外科手术差异性大,需要模型具备强大的泛化能力和对罕见事件的识别能力。高质量、大规模的标注数据仍然是制约模型进一步发展的瓶颈。

3.2.2 骨科

骨科手术,尤其是关节镜手术,对器械操作的精准性和软组织保护有极高要求。骨科手术视频大模型的应用主要体现在:

  • 损伤识别与评估: 关节镜手术中,模型能够辅助识别软骨损伤、韧带撕裂等病变,并对其严重程度进行评估。例如,在关节镜髋关节和膝关节手术视频中,医源性软骨损伤的发生率高达73.8% 47。模型可以帮助医生实时检测这些损伤,减少人为疏忽。
  • 操作规范性评估: 监测外科医生在关节镜下的操作是否符合标准,如避免对关节软骨的医源性损伤。有研究指出,即使是轻微的医源性损伤(如1.5N的接触力)也会导致软骨细胞死亡 47。
  • 术后康复指导: 通过分析术后康复视频,为患者提供个性化的康复方案和动作纠正建议。

骨科的挑战在于骨骼和软组织结构的复杂三维形态,以及手术视野中常出现的模糊和遮挡。此外,特定小关节的手术视频数据相对稀缺。

3.2.3 神经外科

神经外科手术以其高风险、高精度著称,对术中导航和精细操作有着极致要求。手术视频大模型在神经外科领域的应用前景广阔:

  • 关键结构识别与保护: 模型可以实时识别并标记神经、血管等关键结构,帮助外科医生避免损伤。
  • 病变定位与切除辅助: 结合术前影像数据,在术中提供病变区域的精确导航,辅助实现病变的完整切除。例如,Sora这类文本到视频生成AI在神经外科中具有潜在应用,包括患者教育、公众健康、手术培训和规划、以及研究传播等 44。虽然目前生成视频仍存在物理上不合理运动、物体变形等局限性,但未来有望在术前规划中发挥作用。
  • 机器人辅助神经外科: 随着机器人辅助手术在神经外科中的应用,模型可以分析机器人操作视频,评估操作精度和效率,并用于训练机器人进行更精细的操作。

神经外科的挑战在于对实时性、准确性和鲁棒性的极高要求,任何微小误差都可能导致严重后果。此外,神经组织的个体差异大,且手术视野常被血液和脑脊液遮挡,增加了模型识别的难度。

3.2.4 妇产科

妇产科手术视频大模型在产前诊断、微创手术辅助等方面具有独特优势:

  • 胎儿超声图像分析: 在产前超声检查中,深度学习模型已被广泛应用于胎儿异常识别、生物测量和生长曲线生成,减轻了医生工作负担并提高了诊断效率 48。Transformer-based神经网络模型在卵巢癌超声检测中表现出强大的泛化能力和超越专家水平的诊断准确性,有可能缓解超声专家短缺的问题,并改善患者预后 49。
  • 微创妇科手术辅助: 模型可用于腹腔镜或宫腔镜手术,辅助识别子宫内膜异位、肌瘤、卵巢囊肿等病变,并指导手术切除。
  • 手术并发症预警: 实时监测手术过程中的出血、组织损伤等情况,及时预警。

妇产科的挑战在于胎儿超声图像的高度可变性,以及微创手术中视野狭窄、操作空间有限等问题。此外,对女性隐私数据的保护也是需要重点考虑的伦理问题。

总体而言,各专科都在积极探索手术视频大模型在各自领域的应用潜力。虽然面临数据标注成本高、模型泛化能力、实时性、可解释性以及伦理法规等共同挑战,但通过定制化模型开发、领域知识融入和持续的临床验证,这些模型有望在未来显著提升各专科的诊疗水平和效率。

4. 手术视频大语言模型的当前临床应用场景

手术视频大语言模型(SV-LLMs)凭借其强大的视频理解、语义分析和自然语言生成能力,正在逐步渗透到临床手术的各个环节,并在外科技能教学与培训、术中辅助决策支持以及手术质量智能化管控等领域展现出巨大的应用潜力。

4.1 外科技能教学与培训

传统外科培训模式效率低下且资源受限,SV-LLMs的引入正为外科技能教学带来革命性的变革。模型能够深度解析手术视频内容,提供多维度、个性化的培训反馈和辅助教学工具。

  • 自动生成手术操作解说文本: SV-LLMs可以分析手术视频,自动识别手术步骤、器械使用、解剖结构及关键操作,并生成详细、准确的文字解说。这类似于资深外科医生在观看手术视频时进行的旁白讲解。例如,模型可以生成“此阶段正在进行胆囊三角的解剖,注意勿损伤肝总管”、“电凝止血时应避免对周围组织造成热损伤”等文字描述。这些解说文本可以作为培训材料,帮助学员理解手术流程和操作要点,弥补传统教学中录像无解说的不足。
  • 复盘手术失误节点与风险分析: 通过对海量高质量手术视频的学习,SV-LLMs能够识别出偏离标准操作流程、可能导致并发症的关键节点或潜在失误。在培训中,模型可以自动标记出学员手术视频中存在的失误操作,如器械使用不当、组织暴露不足、过度牵拉等,并结合临床指南和专家经验,对这些失误进行风险评估和后果分析。例如,在腹腔镜胆囊切除术中,模型可以识别出未能充分暴露Calot三角的风险,并提醒学员这可能增加胆管损伤的几率。这种自动化的、细致入微的反馈,远超传统人工复盘的效率和广度。
  • 生成个性化技能提升方案: 基于对学员手术表现的全面分析,SV-LLMs可以评估其操作熟练度、手术效率、决策能力等方面,并针对性地生成个性化的技能提升方案。例如,对于在缝合技巧上表现薄弱的学员,模型可以推荐特定的模拟训练任务,并提供相关的专家手术视频供其观摩学习。对于手术时间过长的学员,模型可以分析其在哪些步骤耗时较多,并建议改进策略。这种“千人千面”的教学模式,有助于加速外科医生技能的成长曲线。此外,动画视频在提高患者知识水平方面也显示出积极作用,特别是在手术和糖尿病等健康和临床领域,平均效果为0.35 50。虽然这里主要针对患者教育,但其教学优势同样适用于外科培训。实时传输手术视频(Live-Streaming Surgery)也为医学生提供了一种远程学习和持续教育的有效方式,尤其是在疫情等特殊时期 5152。将SV-LLMs与此类直播技术结合,可以提供更智能、交互性更强的教学体验。

4.2 术中辅助决策支持

在手术过程中,外科医生需要在高压环境下迅速做出决策。SV-LLMs通过实时分析术中视频,为医生提供关键信息和决策支持,提高手术安全性和效率。

  • 实时识别手术步骤与阶段: 模型可以实时监测手术进程,自动识别当前正在进行的手术步骤和所处阶段。例如,在胆囊切除术中,模型可以提示“当前已进入胆囊管分离阶段”、“即将进行胆囊床剥离”。这种实时状态感知有助于医生把握整体手术节奏,尤其对于年轻医生而言,能起到重要的引导作用。现有研究已能通过AI技术在机器人辅助手术中识别下一个手术任务,准确率可达75.7% ,为实时决策提供了可能。
  • 提示操作风险与预警并发症: SV-LLMs结合术中视觉信息、患者生理数据以及临床知识库,能够实时预警潜在的操作风险和并发症。例如,在解剖过程中,模型若检测到血管损伤的早期迹象或组织撕裂的风险,可以立即向医生发出警报。在腹腔镜胆囊切除术中,一项名为SurgSmart的人工智能平台被开发用于自动评估“安全关键视野”(CVS),并在术中实时部署。该平台在三家医院的部署显示,整体CVS评分显著提高(P < 0.01),且大多数外科医生(18人中有15人)在使用平台后表现出改进(P < 0.05)53。这表明SV-LLMs能有效提高术中决策的安全性。
  • 提供解剖结构辅助识别与测量: 模型可以实时高亮显示重要的解剖结构,如神经、血管、淋巴结等,并进行尺寸测量或距离计算。这对于复杂解剖区域的手术尤为重要,有助于避免误伤。例如,在肝脏手术中,三维肝脏模型已被用于客观预测结直肠肝转移的治疗建议,并识别出关键的解剖学参数,例如肿瘤与肝脏表面和门静脉之间的距离 54。虽然这主要用于术前规划,但结合SV-LLMs,这些信息可以实时地在术中呈现给外科医生。
  • 指导器械选择与操作建议: 基于当前手术场景和步骤,模型可以智能推荐合适的器械并提供最佳操作路径或技巧建议。例如,在骨科手术中,模型可以分析当前骨折类型和位置,推荐合适的钢板型号和螺钉植入角度。AI在临床实践中的整合扩展到诊断、规划、术中辅助等多个方面,其中包括基于大型语言模型的分类和编码,以及手术视频中的导航和阶段/手势识别 55。

4.3 手术质量智能化管控

SV-LLMs不仅能在教学和术中提供帮助,还能在手术后的质量管理方面发挥关键作用,实现手术全流程的智能化质控。

  • 自动生成手术质控报告: 模型可以自动分析整个手术视频,提取关键操作数据(如手术时间、器械使用时长、特定步骤耗时、出血量估计等),并结合手术规范和指南,生成标准化的手术质控报告。这些报告包含详细的手术流程分析、操作合规性评估、潜在风险点回顾等,极大地减轻了人工撰写报告的负担,并提高了报告的客观性和一致性。
  • 评估操作规范性与标准化程度: 通过与预设的手术标准流程进行比对,SV-LLMs可以量化评估外科医生的操作规范性。例如,在胃癌根治术中,模型可以评估淋巴结清扫是否彻底,胃肠道重建是否符合标准。这对于外科医生的绩效评估、持续教育和手术标准化推广具有重要意义。有研究使用内窥镜评估食管嗜酸性粒细胞性食管炎的特征,并验证了新的分类和分级系统,证明了视频评估在标准化诊断中的作用 56。类似的方法可以应用于手术操作规范性的评估。
  • 批量筛选手术不良事件案例: SV-LLMs可以大规模地筛选手术视频,自动识别和标记手术中发生的不良事件,如术中出血、脏器损伤、麻醉意外等。这有助于医院及时发现问题、进行风险评估、分析原因并改进流程,从而提升整体手术质量和患者安全。例如,通过识别异常操作行为或特定并发症的早期迹象,模型可以帮助医院快速定位需要重点关注的手术案例,进行深入调查和学习。
  • 辅助医疗纠纷举证与分析: 在发生医疗纠纷时,手术视频是重要的证据。SV-LLMs可以快速梳理和分析视频内容,准确提取与纠纷焦点相关的操作细节和时间节点,为医疗纠纷的处理提供客观、量化的数据支持,辅助责任认定。

SV-LLMs在这些应用场景中的实践,正逐步将手术从传统的经验驱动模式向数据驱动、智能化辅助模式转变,有望全面提升手术的安全性、效率和教学培训质量。

5. 手术视频大语言模型发展面临的核心挑战

手术视频大语言模型(SV-LLMs)尽管展现出巨大的潜力,但在实际应用和推广过程中,仍面临一系列严峻的挑战,这些挑战主要集中在数据、技术、以及监管与伦理层面。

5.1 数据层面挑战

数据是SV-LLMs训练的基石,然而,手术视频数据的特殊性和复杂性,使得数据层面挑战尤为突出。

  • 手术视频标注成本高昂: 手术视频是高度专业化的数据,其标注过程需要经验丰富的外科医生耗费大量时间和精力。高质量的标注不仅包括对手术步骤、器械使用、解剖结构变化的识别,还包括对异常事件、并发症迹象的精确标记以及临床语义的深度理解。例如,一篇综述指出,视频标注是人工智能在手术领域面临的五大挑战之一,因为其耗时且需高度专业知识 57。这种专业性和耗时性导致了极高的标注成本,使得大规模、多中心、高质量的数据集难以快速积累。此外,不同的医生对同一视频的标注可能存在主观差异,进一步增加了标注的一致性和可靠性挑战。

  • 跨机构数据隐私保护难度大: 手术视频中包含了大量患者的敏感信息,如身体状况、疾病诊断、手术过程细节等,这些都属于受保护的健康信息(PHI)58。在多个医疗机构之间共享此类数据用于模型训练时,数据隐私和安全问题成为核心障碍。不同国家和地区有严格的医疗数据保护法规(如HIPAA、GDPR),这些法规限制了数据的自由流动和共享。联邦学习(Federated Learning, FL)被认为是解决这一问题的重要途径,它允许模型在不直接共享原始数据的情况下,通过交换模型参数或梯度在多个机构间进行协同训练 59。然而,FL技术本身也面临模型收敛性、通信效率等挑战,且与集中式数据训练的模型相比,FL模型可能更容易受到病例量大的机构的影响,因此仍需进一步验证其在真实世界的医疗场景中的实施和有效性 59。

  • 罕见术式样本稀缺: 许多复杂或罕见的手术类型,由于其发病率低或实施机构少,导致相关手术视频样本极为稀缺。例如,一项针对脊髓损伤(SCI)干预研究的系统评价发现,81%的研究样本量小于20例,这凸显了罕见疾病样本稀缺的普遍性 60。这使得模型在这些特定术式上的泛化能力和性能受到严重限制。小样本学习(Few-shot learning)和零样本学习(Zero-shot learning)技术旨在缓解这一问题,通过迁移学习或元学习从常见术式中获取的知识,来处理罕见术式数据。然而,这些方法在保证高精度和可靠性方面仍面临挑战,特别是在医疗领域对错误容忍度极低的情况下。此外,现有视频生成技术,如可控光照不变性GAN,可以合成多样化且时间一致的手术视频,以扩充训练数据并增强模型的泛化能力 61。但合成数据能否完全替代真实数据,以及合成数据本身的真实性和多样性如何评估,仍是需要深入研究的问题。

5.2 技术层面挑战

除了数据层面的限制,手术视频大语言模型(SV-LLMs)在技术实现和性能上也存在诸多挑战,这些问题直接影响了模型在临床实践中的可靠性和可用性。

  • 模型实时推理延迟高: 实时性是术中辅助系统成功的关键。外科医生需要在手术过程中即时获得反馈和建议,任何显著的延迟都可能导致信息滞后,甚至影响手术决策。目前的大语言模型,特别是多模态大语言模型(MLLMs),往往参数量巨大,计算复杂度高,导致其在处理高分辨率、高帧率的手术视频流时,难以达到实时的推理速度。这需要强大的计算硬件支持,并且对模型的架构优化提出了更高要求,例如需要开发更轻量级、高效的模型或者采用边缘计算等部署策略。一项研究指出,为了实现高效的远程医疗指导系统,编码时间至关重要,因此过于深入的架构不适用于手术切口提取,他们提出了一个浅层卷积神经网络(S-CNN),在编码性能上取得了显著提升,证明了轻量化模型的重要性 62。

  • 复杂手术场景泛化能力弱: 尽管模型在特定、标准化的手术场景下可能表现良好,但在面对复杂的、非典型的手术情况时,其泛化能力往往不足。手术中可能出现各种意外情况,如解剖变异、病理复杂性、术中出血、器械故障、以及不同外科医生操作习惯的差异等。这些因素都会使模型难以准确识别和理解当前场景。现有模型可能在面对训练数据中未曾充分覆盖的罕见情况时,表现出性能下降,甚至给出错误判断。例如,有研究强调了即使是“最先进”的深度学习模型,在临床环境中的泛化能力也往往受到数据偏见、模型架构、特定任务等因素的影响,并呼吁在模型开发和部署时进行严格的评估和验证 63。

  • 决策输出可解释性不足: 医疗决策事关患者生命安全,医生对AI模型输出的信任度至关重要。目前的深度学习模型,包括LLMs在内,常被视为“黑箱”,其决策过程难以被人类理解和解释。当模型给出诊断建议、风险预警或操作指导时,如果无法解释其推理依据,医生将很难完全信任并采纳。这不仅是一个技术问题,也是一个信任和伦理问题。可解释人工智能(XAI)旨在使AI系统的决策过程更加透明和可理解,这对于SV-LLMs在临床中的广泛应用至关重要。例如,一项预测肝癌术后肝衰竭的机器学习模型,不仅实现了高预测准确率(AUC 0.983),还通过SHAP分析解释了模型中总胆红素、MELD评分等关键变量的影响,从而增强了模型的可解释性 64。同样,在预测脊柱畸形手术输血需求的研究中,使用SHapley Additive exPlanation (SHAP) 来解释预测模型,量化了年龄、体重指数、术前血细胞比容等变量的危害水平 65。这些研究表明,结合可解释性分析的机器学习模型,在医疗领域能够更好地被临床医生所接受和应用。

5.3 监管与伦理层面挑战

手术视频大语言模型在临床应用中,除了数据和技术挑战,还面临着复杂且多变的监管与伦理挑战。这些挑战不仅关乎法律责任的划分,也涉及患者权益保障、医疗公平性以及对医疗实践固有伦理原则的冲击。

  • 模型临床应用的责任界定: 当SV-LLMs辅助医生做出决策或提供建议时,一旦出现医疗事故或不良后果,责任应如何界定是一个悬而未决的难题。是完全归咎于模型的开发者?还是使用模型的医生?亦或是提供数据的机构?现有法律框架通常针对人类行为或传统医疗器械设计,难以直接适用于AI辅助的医疗场景。例如,大语言模型(LLMs)在医学领域的应用引发了对数据隐私、数据溯源、知识产权污染以及广泛应用和可塑性的伦理担忧 66。在患者受到伤害时,如何明确责任分配仍不清楚 67。这需要建立清晰的法规和法律界限,以正确分配责任并保护用户 67。此外,患者对AI的信任度也需考虑,多数患者仍希望医生进行补充评估以确保可靠性和问责制 68。

  • 输出内容的医疗合规性要求: SV-LLMs生成的报告、建议或诊断辅助信息必须符合严格的医疗合规性标准。这意味着模型输出不仅要准确,还要完整、无偏倚,且符合最新的医学指南和临床实践规范。然而,AI模型可能存在“幻觉”(hallucination)现象,即生成看似合理但实际上不准确或完全错误的信息 69。在医疗领域,这种不准确性可能导致严重的后果,如误诊、错误治疗建议,甚至危害患者生命。如何确保模型输出的医疗合规性,需要建立严格的验证、审计和持续监测机制。此外,模型训练数据的偏差可能导致输出内容的偏差,这可能加剧医疗不平等,例如在数据集缺乏有色人种皮肤数据的情况下,可能导致对普通人群的误诊 68。因此,在医疗领域应用AI必须遵守最严格的伦理标准 6770。

  • 患者隐私与数据安全: SV-LLMs的训练和应用需要处理大量的敏感医疗数据,包括患者个人身份信息、病史、检查结果及手术视频等。确保这些数据在收集、存储、传输和使用过程中的隐私性和安全性是至关重要的。数据泄露或滥用不仅会侵犯患者隐私权,还可能引发法律诉讼和社会信任危机。虽然去识别化(de-identification)技术可以降低风险,但随着AI技术的发展,重新识别(re-identification)的风险依然存在。此外,在联邦学习等隐私保护技术被提出以应对这一挑战的同时 7172,如何制定统一的、跨机构的数据共享协议和安全标准仍是一个复杂问题 73。

  • 算法偏见与公平性: SV-LLMs的训练数据往往受到历史数据和临床实践的影响,可能内含性别、种族、社会经济地位等方面的偏见。例如,某些疾病在特定人群中诊断不足或治疗不当的历史数据,可能会被模型学习并放大,导致模型对这些人群的诊断或治疗建议存在偏见,从而加剧医疗不公平 6870。确保模型的公平性,避免算法偏见,是保障医疗伦理的重要方面,需要通过多样化的训练数据、对抗性去偏见技术以及严格的公平性评估来解决。

  • 透明度与可解释性缺失: 前文已提及模型决策过程的“黑箱”特性。在临床实践中,医生需要了解模型做出某个建议的依据,以便判断其合理性并承担最终责任。如果模型无法提供透明的解释,将难以获得医生的信任和临床采纳。缺乏透明度也使得监管机构难以对其进行有效的审计和评估。

  • 对医患关系和人文关怀的影响: 过度依赖AI可能削弱医生的人文关怀能力和与患者建立信任的能力 67。AI辅助系统虽然能提高效率和准确性,但若医生过于依赖AI而减少与患者的直接沟通和情感交流,可能会损害传统的医患关系。如何在提高效率的同时,保持医疗服务的人文温度,是伦理层面需要深思的问题。

  • 知情同意与自主权: 患者是否有权知晓自己的诊疗过程是否使用了AI辅助?如果使用了,他们是否有权选择不使用?AI在医疗领域的应用涉及到患者的知情同意和自主权问题 70。清晰地告知患者AI的使用方式、潜在风险和收益,并获得其明确同意,是伦理实践的必要条件。

这些监管与伦理挑战并非孤立存在,而是相互交织,需要在技术创新、法律法规制定、医学伦理规范和社会观念转变等多层面共同努力,才能确保手术视频大语言模型在医疗领域的负责任和可持续发展。

6. 手术视频大语言模型的未来应用前景与发展方向

手术视频大语言模型(SV-LLMs)正处于快速发展阶段,其未来应用前景广阔,将深刻改变外科医疗的模式。然而,要充分释放其潜力,仍需在技术、临床落地和产业延伸等多个维度持续创新和突破。

6.1 技术迭代方向

SV-LLMs的未来技术迭代将围绕提升模型的性能、效率、隐私保护能力和多模态整合能力展开,主要包括多中心联邦训练框架、轻量化边缘部署模型以及多模态医疗数据联动分析。

6.1.1 多中心联邦训练框架

面对医疗数据隐私保护严格和数据孤岛效应,联邦学习(Federated Learning, FL)成为SV-LLMs技术迭代的关键方向。FL允许多个医疗机构在不直接共享原始敏感患者数据的情况下,协同训练一个共享的机器学习模型 74。

  • 解决数据隐私与合规性问题: FL通过在本地保留数据,只交换模型参数或梯度,有效规避了数据传输和集中存储带来的隐私泄露风险,使其符合严格的医疗数据保护法规(如HIPAA、GDPR)75。例如,在颅内出血检测的临床研究中,一个由五家神经外科部门组成的联邦网络成功地训练了一个卷积神经网络,该模型在不共享数据的情况下,实现了良好的检测性能,并展现出更好的泛化能力 76。
  • 打破数据孤岛,提升模型泛化能力: 不同医院的数据可能具有不同的分布特征(例如,不同设备、不同医生习惯、不同患者群体)。FL能够整合来自不同数据源的异质性数据,从而训练出更鲁棒、泛化能力更强的模型,减少模型对特定数据分布的依赖,提升其在未知环境中的表现。
  • 优化通信与计算效率: 虽然FL避免了数据传输,但模型参数的频繁交换仍会产生通信开销。未来的研究将致力于开发更高效的通信协议、参数压缩技术和异步更新机制,以降低通信成本,并适应医疗机构之间可能存在的网络带宽差异。
  • 实现个性化模型: 结合元学习(Meta-Learning)和个性化联邦学习技术,在共享全局模型的基础上,允许各参与方根据自身特有的数据分布和临床需求,对本地模型进行进一步微调,从而实现既具备通用性又兼具个性化的SV-LLMs。

6.1.2 轻量化边缘部署模型

SV-LLMs在术中辅助等实时场景的应用,对模型的推理速度和资源消耗提出了极高要求。将模型部署到边缘设备(如手术室内的计算单元、手术机器人等)是实现低延迟、高效率的关键。

  • 模型压缩与优化: 这包括模型剪枝(Pruning)、量化(Quantization)、知识蒸馏(Knowledge Distillation)等技术,旨在大幅减小模型体积和计算复杂度,同时尽可能保持性能。例如,有研究通过轻量级多频网络(MFF-Net)在面部视频心率测量中实现了更低的计算负担和更好的性能,这说明了轻量化架构在医疗领域部署的潜力 77。
  • 硬件加速适配: 针对手术室特有的边缘计算硬件平台(如GPU、FPGA、专用AI芯片等),优化模型架构和推理框架,最大化硬件利用效率。
  • 低功耗与实时性: 边缘部署模型需要在满足实时性要求的同时,兼顾功耗限制,这对于长期运行或电池供电的设备尤为重要。
  • 模型更新与维护: 边缘部署的模型需要一套有效的机制进行远程更新和维护,确保模型始终处于最新状态,并能及时修复潜在漏洞。

6.1.3 多模态医疗数据联动分析

手术视频并非孤立存在,而是整个患者诊疗过程中多模态医疗数据链中的一环。SV-LLMs的未来发展将更强调与电子病历、医学影像(CT、MRI等)、生理信号、病理报告甚至基因组数据等多模态信息的深度融合与联动分析。

  • 统一多模态表示学习: 开发能够将不同模态数据映射到统一特征空间的模型,实现跨模态信息的无缝交互和融合。例如,通过共同嵌入空间对齐视频与文本 78。
  • 跨模态知识推理: 利用LLMs强大的推理能力,从结构化和非结构化的多模态数据中挖掘深层次的医学知识和关联。例如,结合影像数据和临床笔记进行综合分析,为手术规划提供更全面的信息 79。
  • 事件预测与风险评估: 通过整合患者术前多模态数据、术中视频分析和术后随访结果,构建更精准的事件预测模型(如术后并发症、预后评估),实现全流程的风险管理。
  • 可解释性与因果分析: 结合多模态数据,增强模型的因果推理能力和可解释性,帮助医生理解模型决策的深层原因,从而提升对AI辅助决策的信任度。例如,通过结合多模态数据进行主动代理协作推理,可以在零样本下实现优于全监督方法的医学推理,并且更具可解释性 80。

这些技术迭代方向将共同推动手术视频大语言模型从单一任务处理走向综合性、智能化辅助,最终实现与整个医疗信息生态系统的深度融合。

6.2 临床落地路径

手术视频大语言模型(SV-LLMs)的最终价值体现在其在临床实践中的有效落地与应用。这不仅需要技术层面的突破,更需要与现有临床工具和工作流程的无缝融合。SV-LLMs与手术导航系统、电子病历系统以及手术机器人等平台的结合,将是其实现临床价值的关键路径。

6.2.1 与手术导航系统集成

手术导航系统通过术前影像(如CT、MRI)重建三维模型,并在术中实时显示器械位置与解剖结构关系,以辅助医生精准操作。SV-LLMs的引入将进一步提升导航系统的智能化水平:

  • 实时解剖结构智能标注与识别: SV-LLMs可以实时分析手术视频流,自动识别和标注关键解剖结构(如神经、血管、肿瘤边界),并将其叠加到导航系统显示的术前三维模型上。这可以弥补传统导航系统在软组织变形、视野受限等情况下的不足。例如,在神经外科手术中,模型可以实时高亮显示视神经、颈内动脉等重要结构,并通过增强现实技术在医生视野中进行提示。
  • 术中风险区域动态提示: 结合SV-LLMs对视频内容的实时理解和对临床知识的掌握,导航系统能够动态提示潜在的风险区域,如肿瘤浸润区、炎症区域或易损伤的血管。当器械接近这些高风险区域时,系统可以发出视觉或听觉警告,提醒外科医生谨慎操作,降低并发症风险。
  • 操作路径智能规划与修正: 在高风险或复杂手术中,SV-LLMs可以基于实时视频分析,结合导航系统的目标区域信息,为外科医生提供优化的操作路径建议,甚至在操作过程中根据实际情况动态调整。这对于提升手术效率和安全性具有重要意义。

6.2.2 与电子病历系统(EHR)深度整合

电子病历系统是临床信息的核心载体,SV-LLMs与EHR的整合能够实现信息的双向流通和深度利用,从而优化整个诊疗流程:

  • 手术记录自动化生成与归档: SV-LLMs能够自动分析手术视频,抽取关键事件(如切开、缝合、器械使用、出血情况等),并根据预设模板自动生成结构化的手术记录草稿。医生只需进行审核和少量修改,即可完成高质量的手术记录。这不仅大大减轻了医生的文书工作负担,也提高了记录的标准化和准确性。例如,有研究指出,人工智能在病理诊断中可自动生成高质量报告,从而减轻病理医生的工作量并提高效率。
  • 术后随访与康复指导个性化: 通过分析手术视频,模型可以为每位患者生成个性化的术后康复方案,并将其自动录入EHR。在随访过程中,EHR可以结合SV-LLMs对手术视频的理解,追踪患者康复进展,并根据需要调整康复计划。
  • 临床决策支持知识库扩充: SV-LLMs从海量手术视频中学习到的经验和模式,可以作为宝贵的临床知识,集成到EHR的决策支持模块中。当医生在EHR中录入患者信息时,系统可以调用SV-LLMs的知识,提供针对性的手术风险评估、并发症预测和治疗方案建议。例如,AI在预测手术部位感染(SSI)方面表现出色,其准确性甚至超越了临床医生的评估,这为EHR中的风险评估工具提供了强大支持 81。
  • 医疗大数据分析与研究: 整合了SV-LLMs生成的手术语义信息和EHR中的其他临床数据,可以构建更全面、更丰富的大型医疗数据库。这为开展大规模的临床研究、疾病机制探索和治疗方案优化提供了前所未有的数据基础。

6.2.3 与手术机器人协同工作

手术机器人通过其高精度、高稳定性,已成为现代外科的重要辅助工具。SV-LLMs与手术机器人的协同工作,将推动机器人辅助手术向更高层次的自主化和智能化发展:

  • 机器人操作的语义理解与评估: SV-LLMs可以实时分析手术机器人执行操作的视频,理解机器人的动作意图、操作质量和潜在风险。例如,评估机器人是否按照预设路径准确移动、是否对组织施加了适当的力、是否存在碰撞风险等。这对于机器人的性能优化、故障诊断和新术式学习至关重要。
  • 人机交互与智能指令: 通过自然语言交互,外科医生可以直接向SV-LLMs发出高层次的手术指令,例如“将胆囊牵拉至上方”、“准备缝合血管”。SV-LLMs将这些指令转化为机器人可执行的精细动作序列,从而实现更流畅、更智能的人机协作。
  • 机器人自主学习与适应: SV-LLMs可以从大量外科医生使用机器人的视频中学习最优操作策略,并通过强化学习等技术,使机器人具备更强的自主学习和适应能力。例如,当遇到复杂的解剖变异时,机器人可以基于SV-LLMs的实时理解,自主调整操作方案。
  • 提升机器人手术培训效果: SV-LLMs可以分析住院医生在机器人模拟器上的操作视频,提供详细的性能评估和个性化反馈,从而加速机器人手术技能的培训过程。

综上所述,手术视频大语言模型在临床中的落地,并非独立运行,而是通过与其他先进医疗系统的紧密集成,共同构建一个更智能、更高效、更安全的未来手术生态系统。

6.3 产业延伸空间

手术视频大语言模型(SV-LLMs)的成熟和临床应用将不仅仅局限于医院内部的直接辅助,更将催生和拓展出一系列围绕外科医疗的产业延伸空间,重塑外科医疗产业链的多个环节。这主要体现在远程手术指导、外科培训智能化体系建设和外科器械研发辅助等领域。

6.3.1 远程手术指导

远程手术指导(Telementoring)是SV-LLMs最具变革潜力的应用之一。它通过实时视频流和AI智能分析,将资深专家的知识和经验扩展到地理受限或资源匮乏的地区,从而实现优质医疗资源的普惠化。

  • 跨地域手术指导与协作: SV-LLMs可以实时分析远程手术视频,为远端外科医生提供智能辅助和指导。例如,在发展中国家或偏远地区进行复杂手术时,资深专家可以通过SV-LLMs远程监控手术进展,接收模型生成的手术步骤识别、风险预警和操作建议,并通过文字或语音形式实时反馈给术者。这有效克服了地理障碍,提高了手术安全性和成功率。远程医疗,尤其是远程手术,已被证明能够有效传输医疗信息和提供协助 8283。结合SV-LLMs,这种指导将更加精准和智能化。
  • 应急和战地医疗支援: 在突发灾难、战地或其他应急环境下,医疗资源极度紧张。SV-LLMs可用于为现场医生提供快速、准确的手术指导,尤其是在外科专家无法及时到达现场的情况下,这将是挽救生命的关键技术。
  • 5G技术赋能远程医疗: 5G通信技术提供的低延迟、高带宽特性,是实现高质量远程手术指导的基石。结合5G技术,SV-LLMs能够确保手术视频的实时传输和模型推理的即时反馈,从而使得远程手术指导真正具备临床可行性 57。

6.3.2 外科培训智能化体系建设

SV-LLMs将成为构建现代化、个性化、高效外科培训体系的核心技术,推动外科教育从经验依赖向数据驱动转变。

  • 交互式虚拟手术模拟器: 基于SV-LLMs对手术视频的深度理解,可以开发出高度逼真且具备智能交互能力的虚拟手术模拟器。这些模拟器不仅能模拟各种手术场景,还能根据学员的操作行为提供实时反馈和指导,甚至模拟并发症的发生并引导学员处理。这比传统模拟器更具智能性和适应性,可以大大加速外科医生的学习曲线。例如,虚拟现实模拟器已被用于外科培训,而结合SV-LLMs可以使其反馈更加个性化和精确。
  • 个性化学习路径规划与评估: SV-LLMs能够持续记录并分析学员的培训表现,包括模拟手术、观摩学习、甚至真实手术视频。基于这些数据,模型可以自动评估学员的技能水平,识别薄弱环节,并智能推荐定制化的学习内容和训练计划,实现“千人千面”的精准教学。这种个性化方案有助于提高培训效率和质量。
  • 自动化认证与考核: 视频审查已被认为是评估外科医生表现的有效工具,未来的外科医生认证和考核,可以部分或全部通过SV-LLMs进行。模型可以客观评估考生的手术操作、决策过程和风险处理能力,提供标准化、公正的评价,从而提高考核的效率和公平性 84。
  • 持续职业发展(CPD)平台: 建立基于SV-LLMs的在线CPD平台,外科医生可以通过上传自己的手术视频获取智能反馈和专业建议,或者通过平台学习最新的手术技术和规范。这有助于外科医生保持持续学习和技能提升。

6.3.3 外科器械研发辅助

SV-LLMs对复杂手术环境和器械操作的深刻理解,将为外科器械的设计、优化和验证提供新的思路和工具。

  • 器械使用效果评估与优化: 通过分析大量手术视频,SV-LLMs可以量化评估不同外科器械在实际操作中的表现,包括其操作效率、安全性、对组织的影响以及在不同解剖结构下的适用性。例如,可以评估某种新型止血钳在不同出血情况下的止血效果,或分析微创手术器械在狭窄空间内的操作灵活性。这些数据将为器械制造商提供宝贵的设计反馈,指导器械的迭代优化。
  • 新器械设计需求洞察: SV-LLMs能够识别手术过程中现有器械难以解决的操作难题或效率瓶颈,从而为新器械的设计提供需求洞察。例如,模型可能发现某种特定缝合在特定部位难度高、耗时长,这可能意味着需要设计一种新型的自动化缝合器。顺应式机构(compliant mechanisms)的设计理念,即利用弹性变形来传递力和运动,为开发无磨损、无夹点的新型手术器械提供了方向,这些器械特别适用于腔镜手术和远程机器人手术 85。SV-LLMs可以辅助分析这些器械的性能。
  • 智能化器械集成与联动: 未来的外科器械将更加智能化。SV-LLMs可以作为这些智能器械的“大脑”,协同多种器械实现更复杂、更精细的操作。例如,通过分析超声刀在手术中的表现及其与组织接触的方式,SV-LLMs可以辅助优化其设计和使用策略。
  • 器械不良事件分析与溯源: 通过对器械相关不良事件视频的分析,SV-LLMs可以辅助制造商和监管机构进行事件溯源、原因分析,从而改进器械设计和生产工艺,提升器械的安全性。

综上所述,手术视频大语言模型不仅是临床医生的强大助手,更是推动整个外科医疗产业升级的关键驱动力。其在远程指导、培训和器械研发等方面的广泛应用,预示着一个更加智能、高效、公平和安全的未来外科医疗生态系统的到来。

内容由 AI 生成,仅供参考,请仔细甄别

参考文献

1A temporal convolutional network-based approach and a benchmark dataset for colonoscopy video temporal segmentation.PubMed

Carlo Biffi, Giorgio Roffo, Pietro Salvagnini, et al.
Comput Methods Programs Biomed. 2025 Oct;270:108782. doi: 10.1016/j.cmpb.2025.108782. Epub 2025 Jul 3.
BACKGROUND AND OBJECTIVE: Following recent advancements in computer-aided detection and diagnosis systems for colonoscopy, the automated reporting of colonoscopy procedures is set to further revolutionize clinical practice. A crucial yet underexplored aspect in the development of these systems is the creation of computer vision models capable of autonomously segmenting full-procedure colonoscopy videos into anatomical sections and procedural phases. In this work, we aim to create the first open-access dataset for this task and propose a state-of-the-art approach, benchmarked against competitive models. METHODS: We annotated the publicly available REAL-Colon dataset, consisting of 2.7 million frames from 60 complete colonoscopy videos, with frame-level labels for anatomical locations and colonoscopy phases across nine categories. We then present ColonTCN, a learning-based architecture that employs custom temporal convolutional blocks designed to efficiently capture long temporal dependencies for the temporal segmentation of colonoscopy videos. We also propose a dual k-fold cross-validation evaluation protocol for this benchmark, which includes model assessment on unseen, multi-center data. RESULTS: ColonTCN achieves state-of-the-art performance in classification accuracy while maintaining a low parameter count when evaluated using the two proposed k-fold cross-validation settings, outperforming competitive models. We report ablation studies to provide insights into the challenges of this task and highlight the benefits of the custom temporal convolutional blocks, which enhance learning and improve model efficiency. CONCLUSIONS: We believe that the proposed open-access benchmark and the ColonTCN approach represent a significant advancement in the temporal segmentation of colonoscopy procedures, fostering further open-access research to address this clinical need. Code and data are available at: https://github.com/cosmoimd/temporal_segmentation.

2Arthroscopic Simulation in Orthopaedic Surgery Training.PubMed

Edward J Testa, Jacob M Modest, Rory A Byrne, et al.
R I Med J (2013). 2023 Oct 2;106(9):46-51.
Surgical simulation has become a commonly utilized and well-researched training adjunct in nearly all surgical specialties. Balancing high-quality orthopaedic surgical training in the face of work hour restrictions and efficiency pressures has become a challenge to educators and trainees alike. Surgical simulation is an opportunity to enhance such training and potentially permit trainees to be better equipped for the operating room. In orthopaedics, various low-fidelity, high-fidelity, and virtual reality simulation platforms are readily available to almost all trainees and permit simulation of a wide array of arthroscopic surgeries. In this review, we seek to highlight the potential utility of simulation-based training in orthopaedic surgery, the various types of available simulators, and review the evidence for simulator use.

3[Simulation in surgical training].PubMed

A Nabavi, J Schipper
HNO. 2017 Jan;65(1):7-12. doi: 10.1007/s00106-016-0248-1.
BACKGROUND: Patient safety during operations hinges on the surgeon's skills and abilities. However, surgical training has come under a variety of restrictions. To acquire dexterity with decreasingly "simple" cases, within the legislative time constraints and increasing expectations for surgical results is the future challenge. OBJECTIVES: Are there alternatives to traditional master-apprentice learning? MATERIALS AND METHODS: A literature review and analysis of the development, implementation, and evaluation of surgical simulation are presented. RESULTS: Simulation, using a variety of methods, most important physical and virtual (computer-generated) models, provides a safe environment to practice basic and advanced skills without endangering patients. These environments have specific strengths and weaknesses. CONCLUSIONS: Simulations can only serve to decrease the slope of learning curves, but cannot be a substitute for the real situation. Thus, they have to be an integral part of a comprehensive training curriculum. Our surgical societies have to take up that challenge to ensure the training of future generations.

4Virtual temporal bone simulators and their use in surgical training: a narrative review.PubMed

Lauren Bolton, Kenneth Young, Jaydip Ray, et al.
J Laryngol Otol. 2024 Apr;138(4):356-360. doi: 10.1017/S0022215123002025. Epub 2023 Nov 17.
OBJECTIVE: Temporal bone dissection is a difficult skill to acquire, and the challenge has recently been further compounded by a reduction in conventional surgical training opportunities during the coronavirus disease 2019 pandemic. Consequently, there has been renewed interest in ear simulation as an adjunct to surgical training for trainees. We review the state-of-the-art virtual temporal bone simulators for surgical training. MATERIALS AND METHODS: A narrative review of the current literature was performed following a Medline search using a pre-determined search strategy. RESULTS AND ANALYSIS: Sixty-one studies were included. There are five validated temporal bone simulators: Voxel-Man, CardinalSim, Ohio State University Simulator, Melbourne University's Virtual Reality Surgical Simulation and Visible Ear Simulator. The merits of each have been reviewed, alongside their role in surgical training. CONCLUSION: Temporal bone simulators have been demonstrated to be useful adjuncts to conventional surgical training methods and are likely to play an increasing role in the future.

5Critical design and validation considerations for the development of neonatal minimally invasive surgery simulators.PubMed

David Nair, Jonathan M Wells, Nick Cook, et al.
J Pediatr Surg. 2019 Nov;54(11):2448-2452. doi: 10.1016/j.jpedsurg.2019.05.022. Epub 2019 Jun 7.
BACKGROUND/PURPOSE: Pediatric surgical trainees have limited exposure to advanced minimally invasive surgery (MIS) during their clinical training, particularly for cases such as esophageal atresia/tracheoesophageal fistula (EA/TEF). Simulation on validated neonatal models offers an alternative means of training that may augment traditional forms of training; but to be useful, they must fulfill certain criteria. METHODOLOGY: Review of the currently available MIS, thoracoscopic and laparoscopic, simulators for pediatric surgery, and identification of those factors that contribute to their fidelity and validity as a training tool that must be incorporated in the design of future simulation models. RESULTS: There are few neonatal laparoscopic and thoracoscopic models currently available, or in the research stage. To our knowledge, there is no commercially available, synthetic, high fidelity and low cost thoracoscopic model in existence. Use of animal tissue has disadvantages of ethical dilemmas, cost, and logistic and procurement issues. Newer synthetic models need to be validated for fidelity, to replicate those components of the operation that pose the greatest technical challenge, and incorporate means of measuring acquisition of technical expertise. CONCLUSION: This review describes the principles that need to be considered to develop low cost, validated high-fidelity MIS simulator that can be used for training, and that is capable of measuring the acquisition of the technical skills that can be applied to the repair of complex procedures such as esophageal atresia. Level of evidence III.

6Artificial Intelligence and Surgical Decision-making.PubMed

Tyler J Loftus, Patrick J Tighe, Amanda C Filiberto, et al.
JAMA Surg. 2020 Feb 1;155(2):148-158. doi: 10.1001/jamasurg.2019.4917.
IMPORTANCE: Surgeons make complex, high-stakes decisions under time constraints and uncertainty, with significant effect on patient outcomes. This review describes the weaknesses of traditional clinical decision-support systems and proposes that artificial intelligence should be used to augment surgical decision-making. OBSERVATIONS: Surgical decision-making is dominated by hypothetical-deductive reasoning, individual judgment, and heuristics. These factors can lead to bias, error, and preventable harm. Traditional predictive analytics and clinical decision-support systems are intended to augment surgical decision-making, but their clinical utility is compromised by time-consuming manual data management and suboptimal accuracy. These challenges can be overcome by automated artificial intelligence models fed by livestreaming electronic health record data with mobile device outputs. This approach would require data standardization, advances in model interpretability, careful implementation and monitoring, attention to ethical challenges involving algorithm bias and accountability for errors, and preservation of bedside assessment and human intuition in the decision-making process. CONCLUSIONS AND RELEVANCE: Integration of artificial intelligence with surgical decision-making has the potential to transform care by augmenting the decision to operate, informed consent process, identification and mitigation of modifiable risk factors, decisions regarding postoperative management, and shared decisions regarding resource use.

7Application and prospect of artificial intelligence in digestive endoscopy.PubMed

Huangming Zhuang, Anyu Bao, Yulin Tan, et al.
Expert Rev Gastroenterol Hepatol. 2022 Jan;16(1):21-31. doi: 10.1080/17474124.2022.2020646. Epub 2021 Dec 27.
INTRODUCTION: With the progress of science and technology, artificial intelligence represented by deep learning has gradually begun to be applied in the medical field. Artificial intelligence has been applied to benign gastrointestinal lesions, tumors, early cancer, inflammatory bowel disease, gallbladder, pancreas, and other diseases. This review summarizes the latest research results on artificial intelligence in digestive endoscopy and discusses the prospect of artificial intelligence in digestive system diseases. AREAS COVERED: We retrieved relevant documents on artificial intelligence in digestive tract diseases from PubMed and Medline. This review elaborates on the knowledge of computer-aided diagnosis in digestive endoscopy. EXPERT OPINION: Artificial intelligence significantly improves diagnostic accuracy, reduces physicians' workload, and provides a shred of evidence for clinical diagnosis and treatment. Shortly, artificial intelligence will have high application value in the field of medicine.

8Management of postdischarge nausea and vomiting.PubMed

Mikhail Dziadzko, Frédéric Aubrun
Best Pract Res Clin Anaesthesiol. 2020 Dec;34(4):771-778. doi: 10.1016/j.bpa.2020.10.008. Epub 2020 Oct 31.
Postdischarge nausea and vomiting (PDNV) occurs in at least 30% of patients leaving hospital, especially after day-case surgery. A significant number of ambulatory patients may develop PDNV associated with the use of analgesics for postsurgical pain. A validated PDNV prediction score and international evidence-based consensus guidelines for PONV/PDNV management are available. High-risk patients benefit from a predischarge PDNV risk assessment and the use of adapted pharmacological intervention (combination of long- and short-acting antiemetics and access to antiemetics at home). Patient education is often overlooked in this context. All clinicians involved in the ambulatory surgery care process should participate in the development of institutional protocol for PONV/PDNV management. Constant quality control and patients' feedback should be integrated as part of an efficient implementation strategy.

9A Multi-Task Convolutional Neural Network for Semantic Segmentation and Event Detection in Laparoscopic Surgery.PubMed

Giorgia Marullo, Leonardo Tanzi, Luca Ulrich, et al.
J Pers Med. 2023 Feb 25;13(3):413. doi: 10.3390/jpm13030413.
The current study presents a multi-task end-to-end deep learning model for real-time blood accumulation detection and tools semantic segmentation from a laparoscopic surgery video. Intraoperative bleeding is one of the most problematic aspects of laparoscopic surgery. It is challenging to control and limits the visibility of the surgical site. Consequently, prompt treatment is required to avoid undesirable outcomes. This system exploits a shared backbone based on the encoder of the U-Net architecture and two separate branches to classify the blood accumulation event and output the segmentation map, respectively. Our main contribution is an efficient multi-task approach that achieved satisfactory results during the test on surgical videos, although trained with only RGB images and no other additional information. The proposed multi-tasking convolutional neural network did not employ any pre- or postprocessing step. It achieved a Dice Score equal to 81.89% for the semantic segmentation task and an accuracy of 90.63% for the event detection task. The results demonstrated that the concurrent tasks were properly combined since the common backbone extracted features proved beneficial for tool segmentation and event detection. Indeed, active bleeding usually happens when one of the instruments closes or interacts with anatomical tissues, and it decreases when the aspirator begins to remove the accumulated blood. Even if different aspects of the presented methodology could be improved, this work represents a preliminary attempt toward an end-to-end multi-task deep learning model for real-time video understanding.

10Robust deep learning-based semantic organ segmentation in hyperspectral images.PubMed

Silvia Seidlitz, Jan Sellner, Jan Odenthal, et al.
Med Image Anal. 2022 Aug;80:102488. doi: 10.1016/j.media.2022.102488. Epub 2022 May 27.
Semantic image segmentation is an important prerequisite for context-awareness and autonomous robotics in surgery. The state of the art has focused on conventional RGB video data acquired during minimally invasive surgery, but full-scene semantic segmentation based on spectral imaging data and obtained during open surgery has received almost no attention to date. To address this gap in the literature, we are investigating the following research questions based on hyperspectral imaging (HSI) data of pigs acquired in an open surgery setting: (1) What is an adequate representation of HSI data for neural network-based fully automated organ segmentation, especially with respect to the spatial granularity of the data (pixels vs. superpixels vs. patches vs. full images)? (2) Is there a benefit of using HSI data compared to other modalities, namely RGB data and processed HSI data (e.g. tissue parameters like oxygenation), when performing semantic organ segmentation? According to a comprehensive validation study based on 506 HSI images from 20 pigs, annotated with a total of 19 classes, deep learning-based segmentation performance increases - consistently across modalities - with the spatial context of the input data. Unprocessed HSI data offers an advantage over RGB data or processed data from the camera provider, with the advantage increasing with decreasing size of the input to the neural network. Maximum performance (HSI applied to whole images) yielded a mean DSC of 0.90 ((standard deviation (SD)) 0.04), which is in the range of the inter-rater variability (DSC of 0.89 ((standard deviation (SD)) 0.07)). We conclude that HSI could become a powerful image modality for fully-automatic surgical scene understanding with many advantages over traditional imaging, including the ability to recover additional functional tissue information. Our code and pre-trained models are available at https://github.com/IMSY-DKFZ/htc.

11Temporal-based Swin Transformer network for workflow recognition of surgical video.PubMed

Xiaoying Pan, Xuanrong Gao, Hongyu Wang, et al.
Int J Comput Assist Radiol Surg. 2023 Jan;18(1):139-147. doi: 10.1007/s11548-022-02785-y. Epub 2022 Nov 4.
PURPOSE: Surgical workflow recognition has emerged as an important part of computer-assisted intervention systems for the modern operating room, which also is a very challenging problem. Although the CNN-based approach achieves excellent performance, it does not learn global and long-range semantic information interactions well due to the inductive bias inherent in convolution. METHODS: In this paper, we propose a temporal-based Swin Transformer network (TSTNet) for the surgical video workflow recognition task. TSTNet contains two main parts: the Swin Transformer and the LSTM. The Swin Transformer incorporates the attention mechanism to encode remote dependencies and learn highly expressive representations. The LSTM is capable of learning long-range dependencies and is used to extract temporal information. The TSTNet organically combines the two components to extract spatiotemporal features that contain more contextual information. In particular, based on a full understanding of the natural features of the surgical video, we propose a priori revision algorithm (PRA) using a priori information about the sequence of the surgical phase. This strategy optimizes the output of TSTNet and further improves the recognition performance. RESULTS: We conduct extensive experiments using the Cholec80 dataset to validate the effectiveness of the TSTNet-PRA method. Our method achieves excellent performance on the Cholec80 dataset, which accuracy is up to 92.8% and greatly exceeds the state-of-the-art methods. CONCLUSION: By modelling remote temporal information and multi-scale visual information, we propose the TSTNet-PRA method. It was evaluated on a large public dataset, showing a high recognition capability superior to other spatiotemporal networks.

12Large Language Models Are Natural Video Popularity PredictorsOpenAlex

Pratik Kayal, Pascal Mettes, Nima Dehmamy, et al.
Predicting video popularity is often framed as a supervised learning task, relying heavily on meta-information and aggregated engagement data.However, video popularity is shaped by complex cultural and social factors that such approaches often overlook.We argue that Large Language Models (LLMs), with their deep contextual awareness, can better capture these nuances.To bridge the gap between pixel-based video data and token-based LLMs, we convert frame-level visuals into sequential text representations using Vision-Language Models.This enables LLMs to process multimodal content-titles, frame-based descriptions, and captions-capturing both engagement intensity (view count) and geographic spread (number of countries where a video trends).On 13,639 popular videos, a supervised neural network using content embeddings achieves 80% accuracy, while our LLM-based approach reaches 82% without fine-tuning.Combining the neural network's predictions with the LLM further improves accuracy to 85.5%.Moreover, the LLM generates interpretable, attribute-based explanations for its predictions.Manual validations confirm the quality of these hypotheses and address concerns about hallucinations in the video-to-text conversion process.Overall, our findings suggest that LLMs, equipped with text-based multimodal representations, offer a powerful, interpretable, and data-efficient solution for tasks requiring rich contextual insight, such as video popularity prediction.

13Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook.PubMed

Rawan AlSaad, Alaa Abd-Alrazaq, Sabri Boughorbel, et al.
J Med Internet Res. 2024 Sep 25;26:e59505. doi: 10.2196/59505.
In the complex and multidimensional field of medicine, multimodal data are prevalent and crucial for informed clinical decisions. Multimodal data span a broad spectrum of data types, including medical images (eg, MRI and CT scans), time-series data (eg, sensor data from wearable devices and electronic health records), audio recordings (eg, heart and respiratory sounds and patient interviews), text (eg, clinical notes and research articles), videos (eg, surgical procedures), and omics data (eg, genomics and proteomics). While advancements in large language models (LLMs) have enabled new applications for knowledge retrieval and processing in the medical field, most LLMs remain limited to processing unimodal data, typically text-based content, and often overlook the importance of integrating the diverse data modalities encountered in clinical practice. This paper aims to present a detailed, practical, and solution-oriented perspective on the use of multimodal LLMs (M-LLMs) in the medical field. Our investigation spanned M-LLM foundational principles, current and potential applications, technical and ethical challenges, and future research directions. By connecting these elements, we aimed to provide a comprehensive framework that links diverse aspects of M-LLMs, offering a unified vision for their future in health care. This approach aims to guide both future research and practical implementations of M-LLMs in health care, positioning them as a paradigm shift toward integrated, multimodal data-driven medical practice. We anticipate that this work will spark further discussion and inspire the development of innovative approaches in the next generation of medical M-LLM systems.

14SEED-Bench: Benchmarking Multimodal Large Language ModelsOpenAlex

Bohao Li, Yuying Ge, Yixiao Ge, et al.
Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given in-terleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3). However, existing MLLM benchmarks remain limited to assessing only models' comprehension ability of single image-text inputs, failing to keep up with the strides made in MLLMs. A comprehensive benchmark is imperative for investigating the progress and uncovering the limitations of current MLLMs. In this work, we categorize the capabilities of MLLMs into hierarchical levels from L<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">0</inf> to L<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">4</inf> based on the modalities they can ac-cept and generate, and propose SEED-Bench, a comprehensive benchmark that evaluates the hierarchical capa-bilities of MLLMs. Specifically, SEED-Bench comprises 24K multiple-choice questions with accurate human annotations, which span 27 dimensions, including the evaluation of both text and image generation. Multiple-choice questions with ground truth options derived from human annotation enable an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 22 prominent open-source MLLMs and summarize valuable observations. By revealing the limitations of existing MLLMs through extensive evaluations, we aim for SEED-Bench to provide insights that will mo-tivate future research toward the goal of General Artificial Intelligence. Dataset and evaluation code are available at https://github.com/AILab-CVC/SEED-Bench.

15A multimodal generative AI copilot for human pathology.PubMed

Ming Y Lu, Bowen Chen, Drew F K Williamson, et al.
Nature. 2024 Oct;634(8033):466-473. doi: 10.1038/s41586-024-07618-3. Epub 2024 Jun 12.
Computational pathology has witnessed considerable progress in the development of both task-specific predictive models and task-agnostic self-supervised vision encoders. However, despite the explosive growth of generative artificial intelligence (AI), there have been few studies on building general-purpose multimodal AI assistants and copilots tailored to pathology. Here we present PathChat, a vision-language generalist AI assistant for human pathology. We built PathChat by adapting a foundational vision encoder for pathology, combining it with a pretrained large language model and fine-tuning the whole system on over 456,000 diverse visual-language instructions consisting of 999,202 question and answer turns. We compare PathChat with several multimodal vision-language AI assistants and GPT-4V, which powers the commercially available multimodal general-purpose AI assistant ChatGPT-4 (ref. ). PathChat achieved state-of-the-art performance on multiple-choice diagnostic questions from cases with diverse tissue origins and disease models. Furthermore, using open-ended questions and human expert evaluation, we found that overall PathChat produced more accurate and pathologist-preferable responses to diverse queries related to pathology. As an interactive vision-language AI copilot that can flexibly handle both visual and natural language inputs, PathChat may potentially find impactful applications in pathology education, research and human-in-the-loop clinical decision-making.

16NExT-GPT: Any-to-Any Multimodal LLMOpenAlex

Shengqiong Wu, Fei, Hao, Leigang Qu, et al.
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community. Project page: https://next-gpt.github.io/

17Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextOpenAlex

Gemini Robotics Team, Petko Georgiev, Ving Ian Lei, et al.
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (&gt;99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

18Modeling and segmentation of surgical workflow from laparoscopic video.PubMed

Tobias Blum, Hubertus Feussner, Nassir Navab
Med Image Comput Comput Assist Interv. 2010;13(Pt 3):400-7. doi: 10.1007/978-3-642-15711-0_50.
Modeling and analyzing surgeries based on signals that are obtained automatically from the operating room (OR) is a field of recent interest. It can be valuable for analyzing and understanding surgical workflow, for skills evaluation and developing context-aware ORs. In minimally invasive surgery, laparoscopic video is easy to record but it is challenging to extract meaningful information from it. We propose a method that uses additional information about tool usage to perform a dimensionality reduction on image features. Using Canonical Correlation Analysis (CCA) a projection of a high-dimensional image feature space to a low dimensional space is obtained such that semantic information is extracted from the video. To model a surgery based on the signals in the reduced feature space two different statistical models are compared. The capability of segmenting a new surgery into phases only based on the video is evaluated. Dynamic Time Warping which strongly depends on the temporal order in combination with CCA shows the best results.

19Hard frame detection for the automated clipping of surgical nasal endoscopic video.PubMed

Hongyu Wang, Xiaoying Pan, Hao Zhao, et al.
Int J Comput Assist Radiol Surg. 2021 Feb;16(2):231-240. doi: 10.1007/s11548-021-02311-6. Epub 2021 Jan 18.
PURPOSE: The automated clipping of surgical nasal endoscopic video is a challenging task because there are many hard frames that have indiscriminative visual features which lead to misclassification. Prior works mainly aim to classify these hard frames along with other frames, and it would seriously affect the performance of classification. METHODS: We propose a hard frame detection method using a convolutional LSTM network (called HFD-ConvLSTM) to remove invalid video frames automatically. Firstly, a new separator based on the coarse-grained classifier is defined to remove the invalid frames. Meanwhile, the hard frames are detected via measuring the blurring score of a video frame. Then, the squeeze-and-excitation is used to select the informative spatial-temporal features of endoscopic videos and further classify the video frames with a fine-grained ConvLSTM learning from the reconstructed training set with hard frames. RESULTS: We justify the proposed solution through extensive experiments using 12 surgical videos (duration:8501 s). The experiments are performed on both hard frame detection and video frame classification. Nearly 88.3% fuzzy frames can be detected and the classification accuracy is boosted to 95.2%. HFD-ConvLSTM achieves superior performance compared to other methods. CONCLUSION: HFD-ConvLSTM provides a new paradigm for video clipping by breaking the complex clipping problem into smaller, more easily managed 2-classification problems. Our investigation reveals that the hard framed detection based on blurring score calculation is effective for nasal endoscopic video clipping.

20Quantitative methodology of evaluating surgeon performance in laparoscopic surgery.PubMed

Paul B McBeth, Antony J Hodgson, Alex G Nagy, et al.
Stud Health Technol Inform. 2002;85:280-6.
Quantitative performance and skill assessments are critical for evaluating the progress of surgical residents and the efficacy of different training programs. Current evaluation methods are subjective and potentially unreliable, so there is a need for objective methods to evaluate surgical performance. We identify a feasible method to measure kinematic data in the live operating room setting and to assess the repeatability of an analysis method based on a hierarchical decomposition of surgical tasks. We used an optoelectronic motion analysis system to acquire postural data and tool tip trajectories of one expert surgeon over a period of four months. To assess repeatability of performance measures, we created a hierarchical decomposition diagram describing the procedure in terms of surgical tasks, tool sequences and fundamental tool actions. From the kinematic data, we extracted characteristic measures of individual tool actions and compared these measured distributions using the Kolmogorov-Smirnov statistic. The comparisons of distributions show consistent performance over time by a trained surgeon and little effect from patient variability, and so are likely reliable measures of performance. An expanded set of reliable kinematic measures will form the basis for quantifying surgical skill and should be useful in validating surgical simulations for use in training, certifying surgeons and designing and evaluating new surgical tools.

21Gesture Recognition in Robotic Surgery: A Review.PubMed

Beatrice van Amsterdam, Matthew J Clarkson, Danail Stoyanov
IEEE Trans Biomed Eng. 2021 Jun;68(6):2021-2035. doi: 10.1109/TBME.2021.3054828. Epub 2021 May 21.
OBJECTIVE: Surgical activity recognition is a fundamental step in computer-assisted interventions. This paper reviews the state-of-the-art in methods for automatic recognition of fine-grained gestures in robotic surgery focusing on recent data-driven approaches and outlines the open questions and future research directions. METHODS: An article search was performed on 5 bibliographic databases with the following search terms: robotic, robot-assisted, JIGSAWS, surgery, surgical, gesture, fine-grained, surgeme, action, trajectory, segmentation, recognition, parsing. Selected articles were classified based on the level of supervision required for training and divided into different groups representing major frameworks for time series analysis and data modelling. RESULTS: A total of 52 articles were reviewed. The research field is showing rapid expansion, with the majority of articles published in the last 4 years. Deep-learning-based temporal models with discriminative feature extraction and multi-modal data integration have demonstrated promising results on small surgical datasets. Currently, unsupervised methods perform significantly less well than the supervised approaches. CONCLUSION: The development of large and diverse open-source datasets of annotated demonstrations is essential for development and validation of robust solutions for surgical gesture recognition. While new strategies for discriminative feature extraction and knowledge transfer, or unsupervised and semi-supervised approaches, can mitigate the need for data and labels, they have not yet been demonstrated to achieve comparable performance. Important future research directions include detection and forecast of gesture-specific errors and anomalies. SIGNIFICANCE: This paper is a comprehensive and structured analysis of surgical gesture recognition methods aiming to summarize the status of this rapidly evolving field.

22An automated skills assessment framework for laparoscopic training tasks.PubMed

Nicholas P Sgouros, Constantinos Loukas, Vassiliki Koufi, et al.
Int J Med Robot. 2018 Feb;14(1). doi: 10.1002/rcs.1853. Epub 2017 Aug 15.
BACKGROUND: Various sensors and methods are used for evaluating trainees' skills in laparoscopic procedures. These methods are usually task-specific and involve high costs or advanced setups. METHODS: In this paper, we propose a novel manoeuver representation feature space (MRFS) constructed by tracking the vanishing points of the edges of the graspers on the video sequence frames, acquired by the standard box trainer camera. This study aims to provide task-agnostic classification of trainees in experts and novices using a single MRFS over two basic laparoscopic tasks. RESULTS: The system achieves an average of 96% correct classification ratio (CCR) when no information on the performed task is available and >98% CCR when the task is known, outperforming a recently proposed video-based technique by >13%. CONCLUSIONS: Robustness, extensibility and accurate task-agnostic classification between novices and experts is achieved by utilizing advanced computer vision techniques and derived features from a novel MRFS.

23Procedure-Aware Hierarchical Alignment for Open Surgery Video-Language Pretraining.PubMed

Boqiang Xu, Jinlin Wu, Jian Liang, et al.
IEEE Trans Image Process. 2026;35:1966-1976. doi: 10.1109/TIP.2026.3659752.
Recent advances in surgical robotics and computer vision have greatly improved intelligent systems' autonomy and perception in the operating room (OR), especially in endoscopic and minimally invasive surgeries. However, for open surgery, which is still the predominant form of surgical intervention worldwide, there has been relatively limited exploration due to its inherent complexity and the lack of large-scale, diverse datasets. To close this gap, we present OpenSurgery, by far the largest video-text pretraining and evaluation dataset for open surgery understanding. OpenSurgery consists of two subsets: OpenSurgery-Pretrain and OpenSurgery-EVAL. OpenSurgery-Pretrain consists of 843 publicly available open surgery videos for pretraining, spanning 102 hours and encompassing over 20 distinct surgical types. OpenSurgery-EVAL is a benchmark dataset for evaluating model performance in open surgery understanding, comprising 280 training and 120 test videos, totaling 49 hours. Each video in OpenSurgery is meticulously annotated by expert surgeons at three hierarchical levels of video, operation, and frame to ensure both high quality and strong clinical applicability. Next, we propose the Hierarchical Surgical Knowledge Pretraining (HierSKP) framework to facilitate large-scale multimodal representation learning for open surgery understanding. HierSKP leverages a granularity-aware contrastive learning strategy and enhances procedural comprehension by constructing hard negative samples and incorporating a Dynamic Time Warping (DTW)-based loss to capture fine-grained temporal alignment of visual semantics. Extensive experiments show that HierSKP achieves state-of-the-art performance on OpenSurgegy-EVAL across multiple tasks, including operation recognition, temporal action localization, and zero-shot cross-modal retrieval. This demonstrates its strong generalizability for further advances in open surgery understanding.

24Pain process of patients with cardiac surgery-Semantic annotation of electronic patient record data.PubMed

Kristiina Heikkilä, Anna Axelin, Laura-Maria Peltonen, et al.
J Clin Nurs. 2019 May;28(9-10):1555-1567. doi: 10.1111/jocn.14752. Epub 2019 Jan 15.
AIMS AND OBJECTIVES: To describe and compare the pain process of the patients' with cardiac surgery through nurses' and physicians' documentations in the electronic patient records. BACKGROUND: Postoperative pain assessment and management should be documented regularly, to ensure optimal pain care process for patients. Despite availability of evidence-based guidelines, pain assessment and documentation remain inadequate. DESIGN: A retrospective patients' record review. METHODS: The original data consisted of the electronic patient records of 26,922 patients with a diagnosed heart disease. A total of 1,818 care episodes of patients with cardiac surgery were selected from the data. We used random sampling to obtain 280 care episodes for annotation. These 280 care episodes contained 2,156 physician reports and 1,327 days of nursing notes. We developed an annotation manual and schema, and then, we manually conducted semantic annotation on care episodes, using the Brat annotation tool. We analysed the annotation units using thematic analysis. Consolidated criteria for reporting qualitative research guideline was followed in reporting where appropriate in this study design. RESULTS: We discovered expressions of six different aspects of pain process: (a) cause, (b) situation, (c) features, (d) consequences, (e) actions and (f) outcomes. We determined that five of the aspects existed chronologically. However, the features of pain were simultaneously existing. They indicated the location, quality, intensity, and temporality of the pain and they were present in every phase of the patient's pain process. Cardiac and postoperative pain documentations differed from each other in used expressions and in the quantity and quality of descriptions. CONCLUSION: We could construct a comprehensive pain process of the patients with cardiac surgery from several electronic patient records. The challenge remains how to support systematic documentation in each patient. RELEVANCE TO CLINICAL PRACTICE: The study provides knowledge and guidance of pain process aspects that can be used to achieve an effective pain assessment and more comprehensive documentation.

25From Cues to Nudge: A Knowledge-Based Framework for Surveillance of Healthcare-Associated Infections.PubMed

Arash Shaban-Nejad, Hiroshi Mamiya, Alexandre Riazanov, et al.
J Med Syst. 2016 Jan;40(1):23. doi: 10.1007/s10916-015-0364-6. Epub 2015 Nov 4.
We propose an integrated semantic web framework consisting of formal ontologies, web services, a reasoner and a rule engine that together recommend appropriate level of patient-care based on the defined semantic rules and guidelines. The classification of healthcare-associated infections within the HAIKU (Hospital Acquired Infections - Knowledge in Use) framework enables hospitals to consistently follow the standards along with their routine clinical practice and diagnosis coding to improve quality of care and patient safety. The HAI ontology (HAIO) groups over thousands of codes into a consistent hierarchy of concepts, along with relationships and axioms to capture knowledge on hospital-associated infections and complications with focus on the big four types, surgical site infections (SSIs), catheter-associated urinary tract infection (CAUTI); hospital-acquired pneumonia, and blood stream infection. By employing statistical inferencing in our study we use a set of heuristics to define the rule axioms to improve the SSI case detection. We also demonstrate how the occurrence of an SSI is identified using semantic e-triggers. The e-triggers will be used to improve our risk assessment of post-operative surgical site infections (SSIs) for patients undergoing certain type of surgeries (e.g., coronary artery bypass graft surgery (CABG)).

26Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports.PubMed

Haitham A Elmarakeby, Pavel S Trukhanov, Vidal M Arroyo, et al.
BMC Bioinformatics. 2023 Sep 2;24(1):328. doi: 10.1186/s12859-023-05439-1.
BACKGROUND: Longitudinal data on key cancer outcomes for clinical research, such as response to treatment and disease progression, are not captured in standard cancer registry reporting. Manual extraction of such outcomes from unstructured electronic health records is a slow, resource-intensive process. Natural language processing (NLP) methods can accelerate outcome annotation, but they require substantial labeled data. Transfer learning based on language modeling, particularly using the Transformer architecture, has achieved improvements in NLP performance. However, there has been no systematic evaluation of NLP model training strategies on the extraction of cancer outcomes from unstructured text. RESULTS: We evaluated the performance of nine NLP models at the two tasks of identifying cancer response and cancer progression within imaging reports at a single academic center among patients with non-small cell lung cancer. We trained the classification models under different conditions, including training sample size, classification architecture, and language model pre-training. The training involved a labeled dataset of 14,218 imaging reports for 1112 patients with lung cancer. A subset of models was based on a pre-trained language model, DFCI-ImagingBERT, created by further pre-training a BERT-based model using an unlabeled dataset of 662,579 reports from 27,483 patients with cancer from our center. A classifier based on our DFCI-ImagingBERT, trained on more than 200 patients, achieved the best results in most experiments; however, these results were marginally better than simpler "bag of words" or convolutional neural network models. CONCLUSION: When developing AI models to extract outcomes from imaging reports for clinical cancer research, if computational resources are plentiful but labeled training data are limited, large language models can be used for zero- or few-shot learning to achieve reasonable performance. When computational resources are more limited but labeled training data are readily available, even simple machine learning architectures can achieve good performance for such tasks.

27Almanac - Retrieval-Augmented Language Models for Clinical Medicine.PubMed

Cyril Zakka, Rohan Shad, Akash Chaurasia, et al.
NEJM AI. 2024 Feb;1(2). doi: 10.1056/aioa2300068. Epub 2024 Jan 25.
BACKGROUND: Large language models (LLMs) have recently shown impressive zero-shot capabilities, whereby they can use auxiliary data, without the availability of task-specific training examples, to complete a variety of natural language tasks, such as summarization, dialogue generation, and question answering. However, despite many promising applications of LLMs in clinical medicine, adoption of these models has been limited by their tendency to generate incorrect and sometimes even harmful statements. METHODS: We tasked a panel of eight board-certified clinicians and two health care practitioners with evaluating Almanac, an LLM framework augmented with retrieval capabilities from curated medical resources for medical guideline and treatment recommendations. The panel compared responses from Almanac and standard LLMs (ChatGPT-4, Bing, and Bard) versus a novel data set of 314 clinical questions spanning nine medical specialties. RESULTS: Almanac showed a significant improvement in performance compared with the standard LLMs across axes of factuality, completeness, user preference, and adversarial safety. CONCLUSIONS: Our results show the potential for LLMs with access to domain-specific corpora to be effective in clinical decision-making. The findings also underscore the importance of carefully testing LLMs before deployment to mitigate their shortcomings. (Funded by the National Institutes of Health, National Heart, Lung, and Blood Institute.).

28Medical large language models are vulnerable to data-poisoning attacks.PubMed

Daniel Alexander Alber, Zihao Yang, Anton Alyakin, et al.
Nat Med. 2025 Feb;31(2):618-626. doi: 10.1038/s41591-024-03445-1. Epub 2025 Jan 8.
The adoption of large language models (LLMs) in healthcare demands a careful analysis of their potential to spread false medical knowledge. Because LLMs ingest massive volumes of data from the open Internet during training, they are potentially exposed to unverified medical knowledge that may include deliberately planted misinformation. Here, we perform a threat assessment that simulates a data-poisoning attack against The Pile, a popular dataset used for LLM development. We find that replacement of just 0.001% of training tokens with medical misinformation results in harmful models more likely to propagate medical errors. Furthermore, we discover that corrupted models match the performance of their corruption-free counterparts on open-source benchmarks routinely used to evaluate medical LLMs. Using biomedical knowledge graphs to screen medical LLM outputs, we propose a harm mitigation strategy that captures 91.9% of harmful content (F1 = 85.7%). Our algorithm provides a unique method to validate stochastically generated LLM outputs against hard-coded relationships in knowledge graphs. In view of current calls for improved data provenance and transparent LLM development, we hope to raise awareness of emergent risks from LLMs trained indiscriminately on web-scraped data, particularly in healthcare where misinformation can potentially compromise patient safety.

29Video-Instrument Synergistic Network for Referring Video Instrument Segmentation in Robotic Surgery.PubMed

Hongqiu Wang, Guang Yang, Shichen Zhang, et al.
IEEE Trans Med Imaging. 2024 Dec;43(12):4457-4469. doi: 10.1109/TMI.2024.3426953. Epub 2024 Dec 2.
Surgical instrument segmentation is fundamentally important for facilitating cognitive intelligence in robot-assisted surgery. Although existing methods have achieved accurate instrument segmentation results, they simultaneously generate segmentation masks of all instruments, which lack the capability to specify a target object and allow an interactive experience. This paper focuses on a novel and essential task in robotic surgery, i.e., Referring Surgical Video Instrument Segmentation (RSVIS), which aims to automatically identify and segment the target surgical instruments from each video frame, referred by a given language expression. This interactive feature offers enhanced user engagement and customized experiences, greatly benefiting the development of the next generation of surgical education systems. To achieve this, this paper constructs two surgery video datasets to promote the RSVIS research. Then, we devise a novel Video-Instrument Synergistic Network (VIS-Net) to learn both video-level and instrument-level knowledge to boost performance, while previous work only utilized video-level information. Meanwhile, we design a Graph-based Relation-aware Module (GRM) to model the correlation between multi-modal information (i.e., textual description and video frame) to facilitate the extraction of instrument-level information. Extensive experimental results on two RSVIS datasets exhibit that the VIS-Net can significantly outperform existing state-of-the-art referring segmentation methods. We will release our code and dataset for future research (https://github.com/whq-xxh/RSVIS).

30Learning the representation of instrument images in laparoscopy videos.PubMed

Sabrina Kletz, Klaus Schoeffmann, Heinrich Husslein
Healthc Technol Lett. 2019 Nov 26;6(6):197-203. doi: 10.1049/htl.2019.0077. eCollection 2019 Dec.
Automatic recognition of instruments in laparoscopy videos poses many challenges that need to be addressed, like identifying multiple instruments appearing in various representations and in different lighting conditions, which in turn may be occluded by other instruments, tissue, blood, or smoke. Considering these challenges, it may be beneficial for recognition approaches that instrument frames are first detected in a sequence of video frames for further investigating only these frames. This pre-recognition step is also relevant for many other classification tasks in laparoscopy videos, such as action recognition or adverse event analysis. In this work, the authors address the task of binary classification to recognise video frames as either instrument or non-instrument images. They examine convolutional neural network models to learn the representation of instrument frames in videos and take a closer look at learned activation patterns. For this task, GoogLeNet together with batch normalisation is trained and validated using a publicly available dataset for instrument count classifications. They compared transfer learning with learning from scratch and evaluate on datasets from cholecystectomy and gynaecology. The evaluation shows that fine-tuning a pre-trained model on the instrument and non-instrument images is much faster and more stable in learning than training a model from scratch.

31Endoscopy Artefact Detection by Deep Transfer Learning of Baseline Models.PubMed

Tang-Kai Yin, Kai-Lun Huang, Si-Rong Chiu, et al.
J Digit Imaging. 2022 Oct;35(5):1101-1110. doi: 10.1007/s10278-022-00627-6. Epub 2022 Apr 27.
To visualise the tumours inside the body on a screen, a long and thin tube is inserted with a light source and a camera at the tip to obtain video frames inside organs in endoscopy. However, multiple artefacts exist in these video frames that cause difficulty during the diagnosis of cancers. In this research, deep learning was applied to detect eight kinds of artefacts: specularity, bubbles, saturation, contrast, blood, instrument, blur, and imaging artefacts. Based on transfer learning with pre-trained parameters and fine-tuning, two state-of-the-art methods were applied for detection: faster region-based convolutional neural networks (Faster R-CNN) and EfficientDet. Experiments were implemented on the grand challenge dataset, Endoscopy Artefact Detection and Segmentation (EAD2020). To validate our approach in this study, we used phase I of 2,200 frames and phase II of 331 frames in the original training dataset with ground-truth annotations as training and testing dataset, respectively. Among the tested methods, EfficientDet-D2 achieves a score of 0.2008 (mAP[Formula: see text]0.6+mIoU[Formula: see text]0.4) on the dataset that is better than three other baselines: Faster-RCNN, YOLOv3, and RetinaNet, and competitive to the best non-baseline result scored 0.25123 on the leaderboard although our testing was on phase II of 331 frames instead of the original 200 testing frames. Without extra improvement techniques beyond basic neural networks such as test-time augmentation, we showed that a simple baseline could achieve state-of-the-art performance in detecting artefacts in endoscopy. In conclusion, we proposed the combination of EfficientDet-D2 with suitable data augmentation and pre-trained parameters during fine-tuning training to detect the artefacts in endoscopy.

32Closing the data gap: leveraging pretrained neural networks for robotic surgical assessment on limited clinical data.PubMed

Nasseh Hashemi, Matias Mose, Lasse R Østergaard, et al.
J Robot Surg. 2025 Nov 24;20(1):39. doi: 10.1007/s11701-025-02994-y.
BACKGROUND: In robot-assisted surgery (RAS), surgical assessment is critical for ensuring competence and achieving optimal surgical outcomes. Artificial intelligence (AI)-based assessment offers an alternative to expert-based assessment but often requires large datasets, which are challenging to obtain. Transfer learning with pretrained algorithms may offer a potential solution and could reduce the need for clinical data. This study explores the use of transfer learning with preclinical porcine data to reduce the clinical data needed for action recognition (AC) and skills assessment (SA) in RAS. METHODS: Abdominal, thoracic and urologic RAS procedures were video recorded. A convolutional neural network (CNN) with a Long Short-Term Memory (LSTM) layer, initially trained using preclinical data, was applied to the clinical dataset through three strategies; (1) direct application on the clinical dataset, (2) only training the LSTM and dense layers, and (3) retraining the entire network. For comparison, a baseline model was trained from scratch on clinical data. RESULTS: Recordings from 15 procedures were included. The baseline clinical model achieved accuracies of 82.7% (AC) and 40.8% (SA). Direct application of the pretrained network resulted in accuracies of 84.8% (AC) and 51.6% (SA). Fine-tuning the LSTM and dense layers of the pretrained network yielded accuracies of 90.1% (AC) and 60.4 (SA), while retraining all layers achieved 90.5% (AC) and 57.6% (SA). Ablation analysis demonstrated higher accuracies with less data using transfer learning, 87.9% vs. 81.6%. CONCLUSIONS: Using pretrained preclinical AI models increases the accuracy of models trained on limited clinical data and reduces the need for clinical data. PUBLIC TRIAL REGISTRY: www.clinicaltrials.gov (ID: NCT06612606). SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s11701-025-02994-y.

33Development of an AI-Assisted System for Automatic Recognition and Localization Marking of Colonic Polyps (With Video).PubMed

Jian Chen, Ganhong Wang, Yu Ding, et al.
J Gastroenterol Hepatol. 2025 Jul;40(7):1797-1808. doi: 10.1111/jgh.16980. Epub 2025 Apr 14.
BACKGROUND: Localizing colorectal polyps identified during the initial colonoscopy in minimally invasive endoscopic surgery presents significant challenges. These challenges include imprecise location descriptions, unclear images, a high number of polyps, and polyp characteristics such as flat shapes and low color contrast. To address these issues, we developed an AI-assisted system for the automatic detection and localization of colorectal polyps. METHODS: Colonic images and videos from three medical centers, collected between January 2018 and August 2024, were categorized based on pathology results into normal, adenomatous polyp, and serrated lesion groups. Transfer learning and fine-tuning were conducted on five pretrained CNN models, with performance evaluated using metrics such as accuracy, precision, sensitivity, and AUC. The best-performing model was selected for interpretability analysis and developed into an AI-assisted system capable of both polyp recognition and location marking. RESULTS: Among the five models, EfficientNetV2 performed the best, achieving accuracy, precision, sensitivity, and F1 scores of 0.933, 0.917, 0.916, and 0.917, respectively, on the validation set. On the test set, the model's overall weighted average precision, specificity, and AUC were 0.903, 0.946, and 0.983, respectively. Two representative colonoscopy case videos predicted by the model further demonstrated the feasibility of this AI system in clinical practice. CONCLUSIONS: The AI system we developed for the automatic recognition and localization marking of colonic polyps in colonoscopy aids in the rapid localization of polyps during minimally invasive endoscopic surgery.

34A systematic review on artificial intelligence in robot-assisted surgery.PubMed

Andrea Moglia, Konstantinos Georgiou, Evangelos Georgiou, et al.
Int J Surg. 2021 Nov;95:106151. doi: 10.1016/j.ijsu.2021.106151. Epub 2021 Oct 22.
BACKGROUND: Despite the extensive published literature on the significant potential of artificial intelligence (AI) there are no reports on its efficacy in improving patient safety in robot-assisted surgery (RAS). The purposes of this work are to systematically review the published literature on AI in RAS, and to identify and discuss current limitations and challenges. MATERIALS AND METHODS: A literature search was conducted on PubMed, Web of Science, Scopus, and IEEExplore according to PRISMA 2020 statement. Eligible articles were peer-review studies published in English language from January 1, 2016 to December 31, 2020. Amstar 2 was used for quality assessment. Risk of bias was evaluated with the Newcastle Ottawa Quality assessment tool. Data of the studies were visually presented in tables using SPIDER tool. RESULTS: Thirty-five publications, representing 3436 patients, met the search criteria and were included in the analysis. The selected reports concern: motion analysis (n = 17), urology (n = 12), gynecology (n = 1), other specialties (n = 1), training (n = 3), and tissue retraction (n = 1). Precision for surgical tools detection varied from 76.0% to 90.6%. Mean absolute error on prediction of urinary continence after robot-assisted radical prostatectomy (RARP) ranged from 85.9 to 134.7 days. Accuracy on prediction of length of stay after RARP was 88.5%. Accuracy on recognition of the next surgical task during robot-assisted partial nephrectomy (RAPN) achieved 75.7%. CONCLUSION: The reviewed studies were of low quality. The findings are limited by the small size of the datasets. Comparison between studies on the same topic was restricted due to algorithms and datasets heterogeneity. There is no proof that currently AI can identify the critical tasks of RAS operations, which determine patient outcome. There is an urgent need for studies on large datasets and external validation of the AI algorithms used. Furthermore, the results should be transparent and meaningful to surgeons, enabling them to inform patients in layman's words. REGISTRATION: Review Registry Unique Identifying Number: reviewregistry1225.

35LMT++: Adaptively Collaborating LLMs With Multi-Specialized Teachers for Continual VQA in Robotic Surgical Videos.PubMed

Yuyang Du, Kexin Chen, Yue Zhan, et al.
IEEE Trans Med Imaging. 2025 Nov;44(11):4678-4689. doi: 10.1109/TMI.2025.3581108.
Visual question answering (VQA) plays a vital role in advancing surgical education. However, due to the privacy concern of patient data, training VQA model with previously used data becomes restricted, making it necessary to use the exemplar-free continual learning (CL) approach. Previous CL studies in the surgical field neglected two critical issues: i) significant domain shifts caused by the wide range of surgical procedures collected from various sources, and ii) the data imbalance problem caused by the unequal occurrence of medical instruments or surgical procedures. This paper addresses these challenges with a multimodal large language model (LLM) and an adaptive weight assignment strategy. First, we developed a novel LLM-assisted multi-teacher CL framework (named LMT++), which could harness the strength of a multimodal LLM as a supplementary teacher. The LLM's strong generalization ability, as well as its good understanding of the surgical domain, help to address the knowledge gap arising from domain shifts and data imbalances. To incorporate the LLM in our CL framework, we further proposed an innovative approach to process the training data, which involves the conversion of complex LLM embeddings into logits value used within our CL training framework. Moreover, we design an adaptive weight assignment approach that balances the generalization ability of the LLM and the domain expertise of conventional VQA models obtained in previous model training processes within the CL framework. Finally, we created a new surgical VQA dataset for model evaluation. Comprehensive experimental findings on these datasets show that our approach surpasses state-of-the-art CL methods.

36Artificial Intelligence to Predict the Risk of Lymph Node Metastasis in T2 Colorectal Cancer.PubMed

Katsuro Ichimasa, Caterina Foppa, Shin-Ei Kudo, et al.
Ann Surg. 2024 Nov 1;280(5):850-857. doi: 10.1097/SLA.0000000000006469. Epub 2024 Jul 30.
OBJECTIVE: To develop and externally validate an updated artificial intelligence (AI) prediction system for stratifying the risk of lymph node metastasis (LNM) in T2 colorectal cancer (CRC). BACKGROUND: Recent technical advances allow complete local excision of T2 CRC, traditionally treated with surgical resection. Yet, the widespread adoption of this approach is hampered by the inability to stratify the risk of LNM. METHODS: Data from patients with pT2 CRC undergoing surgical resection between April 2000 and May 2022 at one Japanese and one Italian center were analyzed. Primary goal was AI system development for accurate LNM prediction. Predictors encompassed 7 variables: age, sex, tumor size, tumor location, lymphovascular invasion, histologic differentiation, and carcinoembryonic antigen level. The tool's discriminating power was assessed through area under the curve, sensitivity, and specificity. RESULTS: Out of 735 initial patients, 692 were eligible. Training and validation cohorts comprised of 492 and 200 patients, respectively. The AI model displayed an area under the curve of 0.75 in the combined validation data set. Sensitivity for LNM prediction was 97.8%, and specificity was 15.6%. The positive and the negative predictive value were 25.7% and 96%, respectively. The false negative rate was 2.2%, and the false positive was 84.4%. CONCLUSIONS: Our AI model, based on easily accessible clinical and pathologic variables, moderately predicts LNM in T2 CRC. However, the risk of false negative needs to be considered. The training of the model including more patients across western and eastern centers - differentiating between colon and rectal cancers - may improve its performance and accuracy.

37Artificial intelligence-based risk stratification, accurate diagnosis and treatment prediction in gynecologic oncology.PubMed

Yuting Jiang, Chengdi Wang, Shengtao Zhou
Semin Cancer Biol. 2023 Nov;96:82-99. doi: 10.1016/j.semcancer.2023.09.005. Epub 2023 Sep 30.
As data-driven science, artificial intelligence (AI) has paved a promising path toward an evolving health system teeming with thrilling opportunities for precision oncology. Notwithstanding the tremendous success of oncological AI in such fields as lung carcinoma, breast tumor and brain malignancy, less attention has been devoted to investigating the influence of AI on gynecologic oncology. Hereby, this review sheds light on the ever-increasing contribution of state-of-the-art AI techniques to the refined risk stratification and whole-course management of patients with gynecologic tumors, in particular, cervical, ovarian and endometrial cancer, centering on information and features extracted from clinical data (electronic health records), cancer imaging including radiological imaging, colposcopic images, cytological and histopathological digital images, and molecular profiling (genomics, transcriptomics, metabolomics and so forth). However, there are still noteworthy challenges beyond performance validation. Thus, this work further describes the limitations and challenges faced in the real-word implementation of AI models, as well as potential solutions to address these issues.

38Decoding uncertainty for clinical decision-making.PubMed

Krasimira Tsaneva-Atanasova, Giulia Pederzanil, Marianna Laviola
Philos Trans A Math Phys Eng Sci. 2025 Mar 13;383(2292):20240207. doi: 10.1098/rsta.2024.0207.
In this opinion piece, we examine the pivotal role that uncertainty quantification (UQ) plays in informing clinical decision-making processes. We explore challenges associated with healthcare data and the potential barriers to the widespread adoption of UQ methodologies. In doing so, we highlight how these techniques can improve the precision and reliability of medical evaluations. We delve into the crucial role of understanding and managing the uncertainties present in clinical data (such as measurement error), diagnostic tools and treatment outcomes. We discuss how such uncertainties can impact decision-making in healthcare and emphasize the importance of systematically analysing them. Our goal is to demonstrate how effectively addressing and decoding uncertainties can significantly enhance the accuracy and robustness of clinical decisions, ultimately leading to better patient outcomes and more informed healthcare practices.This article is part of the theme issue 'Uncertainty quantification for healthcare and biological systems (Part 1)'.

39Artificial intelligence-assisted endoscopic diagnosis system for diagnosing Helicobacter pylori infection: a multicenter study.PubMed

Yue Hu, Jianwei Xu, Liang Huang, et al.
BMC Med. 2025 Oct 8;23(1):540. doi: 10.1186/s12916-025-04379-2.
BACKGROUND: Deep learning algorithm-based artificial intelligence (AI) has significantly advanced the domain of endoscopic diagnosis; however, its utilization for detecting Helicobacter pylori (H. pylori) infections remains constrained. We aimed to develop and validate the AI diagnostic system (HOPE AI) for diagnosing H. pylori infection by analyzing extensive imaging data obtained from clinical endoscopies. METHODS: This multicenter diagnostic study was carried out across seven hospitals in China. Eligible patients were individuals aged 18 years or older who underwent upper gastrointestinal gastroendoscopy. The endoscopic images were randomly allocated (7:3) to the training and internal validation datasets for the development of HOPE AI, utilizing a multi-instance learning (MIL) framework and long short-term memory (LSTM) architectures, and the prospective external validation dataset for assessing its diagnostic efficacy. The performance of HOPE AI was also benchmarked against endoscopists. The diagnostic accuracy, sensitivity, specificity, and area under the curve of HOPE AI were assessed to detect H. pylori infection. RESULTS: A total of 308,887 endoscopic images and 197 videos from 6207 patients were utilized to develop and evaluate HOPE AI. Our AI system demonstrated outstanding performance, achieving an AUC of 0.932 (95% confidence interval (CI) 0.906-0.956) in the internal validation set, 0.903 (0.883-0.922) in the external temporal validation set, 0.923 (0.875-0.961) in the external temporal validation video set, and ranging from 0.855 (0.813-0.894) to 0.971 (0.955-0.985) across seven external geographical validation sets. The diagnostic sensitivity of HOPE AI (85.7%) significantly surpassed that of senior endoscopists (68.0%). CONCLUSIONS: HOPE AI exhibited robust diagnostic efficacy and interpretability in H. pylori detection, thereby enhancing the efficiency of diagnosis in routine screening contexts. TRIAL REGISTRATION: Chinese Clinical Trial Registry: ChiCTR 2400091317, 2,400,091,720.

40Ethical evaluation in acute stroke decision-making.PubMed

Michel Shamy, Brian Dewar, Mark Fedyk
J Eval Clin Pract. 2024 Aug;30(5):749-755. doi: 10.1111/jep.13927. Epub 2023 Oct 5.
RATIONALE: The evidentiary standards and epistemic models of clinical care, especially those of evidence-based medicine, are dissimilar to those used in philosophy and examination of how the two systems intersect may help clinicians make more informed treatment decisions. AIMS AND OBJECTIVES: This paper examines the use of ethical frameworks in routine clinical decision-making, using the example of acute stroke treatment decisions to demonstrate that ethical evaluation is integral to clinical practice. METHOD: Utilising acute stroke care as a lens through which to examine the phenomenon of ethical evaluation in medical practice, we offer a philosophical analysis of the presence of ethical evaluation in medicine. RESULTS AND CONCLUSION: We find that the medical establishment should embrace ethical evaluation as intrinsic to medical practice and that medical training and treatment guidelines should reflect this reality. Patients deserve clarity and transparency about how physicians make determinations about their treatment, and physicians should be prepared to offer explanations for those decisions.

41SurGen: Text-Guided Diffusion Model for Surgical Video GenerationOpenAlex

Joseph Cho, Samuel Schmidgall, Cyril Zakka, et al.
Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis. SurGen produces videos with the highest resolution and longest duration among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.

42Interactive Generation of Laparoscopic Videos with Diffusion ModelsOpenAlex

Ivan Iliash, Simeon Allmendinger, Felix Meissen, et al.

43Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation ModelOpenAlex

Wei Li, Ming Hu, Guoan Wang, et al.

44Text-to-video generative artificial intelligence: sora in neurosurgeryOpenAlex

Ali A. Mohamed, Brandon Lucke‐Wold

45EndoViT: pretraining vision transformers on a large collection of endoscopic imagesOpenAlex

Dominik Batić, Felix Holm, Ege Özsoy, et al.
PURPOSE: Automated endoscopy video analysis is essential for assisting surgeons during medical procedures, but it faces challenges due to complex surgical scenes and limited annotated data. Large-scale pretraining has shown great success in natural language processing and computer vision communities in recent years. These approaches reduce the need for annotated data, which is of great interest in the medical domain. In this work, we investigate endoscopy domain-specific self-supervised pretraining on large collections of data. METHODS: To this end, we first collect Endo700k, the largest publicly available corpus of endoscopic images, extracted from nine public Minimally Invasive Surgery (MIS) datasets. Endo700k comprises more than 700,000 images. Next, we introduce EndoViT, an endoscopy-pretrained Vision Transformer (ViT), and evaluate it on a diverse set of surgical downstream tasks. RESULTS: Our findings indicate that domain-specific pretraining with EndoViT yields notable advantages in complex downstream tasks. In the case of action triplet recognition, our approach outperforms ImageNet pretraining. In semantic segmentation, we surpass the state-of-the-art (SOTA) performance. These results demonstrate the effectiveness of our domain-specific pretraining approach in addressing the challenges of automated endoscopy video analysis. CONCLUSION: Our study contributes to the field of medical computer vision by showcasing the benefits of domain-specific large-scale self-supervised pretraining for vision transformers. We release both our code and pretrained models to facilitate further research in this direction: https://github.com/DominikBatic/EndoViT .

46Deep learning approaches to surgical video segmentation and object detection: A scoping review.PubMed

Devanish N Kamtam, Joseph B Shrager, Satya Deepya Malla, et al.
Comput Biol Med. 2025 Aug;194:110482. doi: 10.1016/j.compbiomed.2025.110482. Epub 2025 Jun 2.
INTRODUCTION: Computer vision (CV) has had a transformative impact in biomedical fields such as radiology, dermatology, and pathology. Its real-world adoption in surgical applications, however, remains limited. We review the current state-of-the-art performance of deep learning (DL)-based CV models for segmentation and object detection of anatomical structures in videos obtained during surgical procedures. METHODS: We conducted a scoping review of studies on semantic segmentation and object detection of anatomical structures published between 2014 and 2024 from 3 major databases - PubMed, Embase, and IEEE Xplore. The primary objective was to evaluate the state-of-the-art performance of semantic segmentation in surgical videos. Secondary objectives included examining DL models, progress toward clinical applications, and the specific challenges with segmentation of organs/tissues in surgical videos. RESULTS: We identified 61 relevant published studies. These focused predominantly on procedures from general surgery [22(36.1 %)], colorectal surgery [9(14.7 %)], and neurosurgery [8(13.1 %)]. Cholecystectomy [16(26.2 %)] and low anterior rectal resection [5(8.2 %)] were the most common procedures addressed. Semantic segmentation [50(82 %)] was the primary CV task. U-Net [13(21.3 %)] and DeepLab [13(21.3 %)] were the most widely used models. Larger organs such as the liver (Dice score: 0.88) had higher accuracy compared to smaller structures such as nerves (Dice score: 0.49). Models demonstrated real-time inference potential ranging from 5 to 298 frames-per-second (fps). CONCLUSION: This review highlights the significant progress made in DL-based semantic segmentation for surgical videos with real-time applicability, particularly for larger organs. Addressing challenges with smaller structures, data availability, and generalizability remains crucial for future advancements.

47Iatrogenic Articular Cartilage Injury in Arthroscopic Hip and Knee Videos and the Potential for Cartilage Cell Death When Simulated in a Bovine Model.PubMed

Jocelyn Compton, Michael Slattery, Mitchell Coleman, et al.
Arthroscopy. 2020 Aug;36(8):2114-2121. doi: 10.1016/j.arthro.2020.02.017. Epub 2020 Mar 4.
PURPOSE: To determine the incidence and characterize the severity of iatrogenic cartilage injuries. METHODS: Technique videos of arthroscopic femoral acetabular impingement procedures and meniscus repairs on VuMedi (n = 85) and Arthroscopy Techniques (n = 45) were reviewed and iatrogenic cartilage injuries were identified and graded (minor, intermediate, and major injury) by 2 independent reviewers. To demonstrate that even minor injuries on a cellular scale result in damage, a bovine osteochondral explant was used to create comparable minor iatrogenic injuries at varied forces that do not disrupt the articular surface (1.5 N, 2.5 N, and 9.8 N). Dead chondrocytes at the site of injury were stained with ethidium homodimer-2 and imaged with an Olympus FV1000 confocal microscope. χ tests were used for analysis; all results with P < .05 were considered significant. RESULTS: In total, 130 videos of arthroscopic meniscus and femoral acetabular impingement procedures were analyzed and the incidence of iatrogenic cartilage injury was 73.8%. There were 110 (70.0%) minor, 35 (22.3%) intermediate, and 11 (7.0%) major iatrogenic injuries. All forces tested in the minor injury bovine model resulted in chondrocyte death at the site of contact. CONCLUSIONS: Iatrogenic articular cartilage injuries are common in arthroscopy, occurring in more than 70% of the surgeon-published instructional videos analyzed. At least some chondrocyte death occurs with minor simulated iatrogenic injuries (1.5 N). CLINICAL RELEVANCE: The high rate of cartilage damage during arthroscopic technique videos likely under-represents the true incidence in clinical practice. Cell death occurs in the bovine minor injury model with minimal contact forces. This suggests iatrogenic cartilage damage during arthroscopy could contribute to clinical outcomes.

48Use of artificial intelligence and deep learning in fetal ultrasound imaging.PubMed

R Ramirez Zegarra, T Ghi
Ultrasound Obstet Gynecol. 2023 Aug;62(2):185-194. doi: 10.1002/uog.26130. Epub 2023 Jul 10.
Deep learning is considered the leading artificial intelligence tool in image analysis in general. Deep-learning algorithms excel at image recognition, which makes them valuable in medical imaging. Obstetric ultrasound has become the gold standard imaging modality for detection and diagnosis of fetal malformations. However, ultrasound relies heavily on the operator's experience, making it unreliable in inexperienced hands. Several studies have proposed the use of deep-learning models as a tool to support sonographers, in an attempt to overcome these problems inherent to ultrasound. Deep learning has many clinical applications in the field of fetal imaging, including identification of normal and abnormal fetal anatomy and measurement of fetal biometry. In this Review, we provide a comprehensive explanation of the fundamentals of deep learning in fetal imaging, with particular focus on its clinical applicability. © 2022 International Society of Ultrasound in Obstetrics and Gynecology.

49International multicenter validation of AI-driven ultrasound detection of ovarian cancer.PubMed

Filip Christiansen, Emir Konuk, Adithya Raju Ganeshan, et al.
Nat Med. 2025 Jan;31(1):189-196. doi: 10.1038/s41591-024-03329-4. Epub 2025 Jan 2.
Ovarian lesions are common and often incidentally detected. A critical shortage of expert ultrasound examiners has raised concerns of unnecessary interventions and delayed cancer diagnoses. Deep learning has shown promising results in the detection of ovarian cancer in ultrasound images; however, external validation is lacking. In this international multicenter retrospective study, we developed and validated transformer-based neural network models using a comprehensive dataset of 17,119 ultrasound images from 3,652 patients across 20 centers in eight countries. Using a leave-one-center-out cross-validation scheme, for each center in turn, we trained a model using data from the remaining centers. The models demonstrated robust performance across centers, ultrasound systems, histological diagnoses and patient age groups, significantly outperforming both expert and non-expert examiners on all evaluated metrics, namely F1 score, sensitivity, specificity, accuracy, Cohen's kappa, Matthew's correlation coefficient, diagnostic odds ratio and Youden's J statistic. Furthermore, in a retrospective triage simulation, artificial intelligence (AI)-driven diagnostic support reduced referrals to experts by 63% while significantly surpassing the diagnostic performance of the current practice. These results show that transformer-based models exhibit strong generalization and above human expert-level diagnostic accuracy, with the potential to alleviate the shortage of expert ultrasound examiners and improve patient outcomes.

50Using Animated Videos to Increase Patient Knowledge: A Meta-Analytic Review.PubMed

Thomas Hugh Feeley, Maria Keller, Liise Kayler
Health Educ Behav. 2023 Apr;50(2):240-249. doi: 10.1177/10901981221116791. Epub 2022 Aug 11.
This article meta-analyzed 21 studies that tested the effectiveness of animated videos in improving learning in clinical and nonclinical settings compared with standard education. Animation was defined as the use of moving objects that are typically drawn or simulated. Videos ranged from just over 2 min in duration to 16 min in duration in articles published from 2009 through 2020. Mayer's Cognitive Theory of Multimedia Learning provided the theoretical model to frame the current analyses. Findings indicated an overall positive effect ( = 0.35) for use of animation in improving viewers' learning across a variety of health and clinical contexts, including surgery and diabetes. Moderator analyses indicated learning effects were greater in patient samples and samples with a higher proportion of male participants. Study findings were discussed in terms of the theoretical and practical implications for health communication scholars and practitioners.

51Live-Streaming Surgery for Medical Student Education - Educational Solutions in Neurosurgery During the COVID-19 Pandemic.PubMed

Megan M Jack, Domenico A Gattozzi, Paul J Camarata, et al.
J Surg Educ. 2021 Jan-Feb;78(1):99-103. doi: 10.1016/j.jsurg.2020.07.005. Epub 2020 Jul 31.
OBJECTIVE: The COVID-19 pandemic significantly altered medical student education. The ability for students to be a part of the operating room team was highly restricted. Technology can be used to ensure ongoing surgical education during this time of limited in-person educational opportunities. DESIGN: We have developed an innovative solution of securely live-streaming surgery with real-time communication between the surgeon and students to allow for ongoing education during the pandemic. RESULTS: We successfully live-streamed multiple different types of neurosurgical operations utilizing multiple video sources. This method uses inexpensive, universal equipment that can be implemented at any institution to enable virtual education of medical students and other learners. CONCLUSIONS: This technology has facilitated education during this challenging time. This technological set-up for live-streaming surgery has the potential of improving medical and graduate medical education in the future.

52Surgical telementoring: Feasibility, applicability, and how to.PubMed

Rodrigo Gerardo, Prachi Lele, Krithika Sundaram, et al.
J Surg Oncol. 2021 Aug;124(2):241-245. doi: 10.1002/jso.26511.
Surgical training does not end at the conclusion of residency training. Expansions in medical technology and surgical technique have created a steep learning curve for the young attending surgeon. The emergence of intraoperative telementoring has allowed experienced surgeons to guide learners through complex surgical cases remotely with the assistance of streaming video technology. Here, we describe the basics of telementoring, financial and legal considerations, and recommend hardware specifications for optimal use.

53Development, application and evaluation of an artificial intelligence (AI)-based platform (SurgSmart) for the automatic assessment of the critical view of safety (CVS) in laparoscopic cholecystectomy (LC).PubMed

Ming Tang, Ran Hu, Dian Qin, et al.
Surg Endosc. 2026 Mar 6. doi: 10.1007/s00464-026-12658-z.
BACKGROUND: Laparoscopic Cholecystectomy (LC) is the standard surgical treatment for symptomatic benign gallbladder diseases. Bile Duct Injury (BDI) is a common and serious complication of LC. Critical View of Safety (CVS) has been proven crucial in preventing BDI. However, current attainment rate of CVS remains low. Recent advancements in AI offer an efficient method for this gap. This study aims to develop an intelligent surgical platform (SurgSmart) enabling real-time assessment of the Critical View of Safety (CVS), integrate it into routine surgical practice, and evaluate its performance and user acceptance based on intraoperative and post-operative feedback. MATERIALS AND METHODS: A total of 377 LC videos from 17 hospitals were retrospectively collected for training, validation, and testing of the AI algorithm. The model's effectiveness was evaluated using accuracy, precision, recall, F1-score, and macro-average F1-score. Our platform was deployed in the operating rooms of three hospitals. From May to October 2024, we collected LC videos and surgical reports and assessing variations in CVS scores. Surgeons were surveyed to evaluate user satisfaction, surgical confidence and to gather suggestions. RESULTS: For CVS I, II, and III the overall accuracy and macro-average F1-score are 0.91 and 0.72, 0.86 and 0.67, 0.73 and 0.70, respectively. The overall CVS scores for the three hospitals showed significant improvement after the deployment of platform (P < 0.01). Fifteen out of the eighteen surgeons who used our platform demonstrated overall improvement (P < 0.05). Surgeons' satisfaction was high, with recommendations including more adequate training and guidance as well as further improvements in model performance. CONCLUSION: This platform has demonstrated its feasibility for real-time and automated CVS assessment. Most surgeons improved after using our platform. Surgeons reported positive feedback and expressed hope for more adequate guidance and continuous improvements in model performance.

54Mathematical 3D Liver Model for Surgical versus Ablative Therapy Treatment Planning for Colorectal Liver Metastases: Recommendations from the COLLISION and COLDFIRE Trial Expert Panels.PubMed

Bente A T van den Bemd, Robbert S Puijk, Han Keijzers, et al.
Radiol Imaging Cancer. 2024 Nov;6(6):e240068. doi: 10.1148/rycan.240068.
Purpose To further define anatomic criteria for resection and ablation using an expert panel-based three-dimensional liver model to objectively predict local treatment recommendations for colorectal liver metastases (CRLM). Materials and Methods This study analyzed data from participants with small CRLM (≤3 cm) considered suitable for resection, thermal ablation, or irreversible electroporation (IRE), according to a multidisciplinary expert panel, who were included in two prospective multicenter trials (COLLISION [NCT03088150] and COLDFIRE-2 [NCT02082782]) between August 2017 and June 2022. Ten randomly selected participants were used to standardize the model's Couinaud segments. CRLM coordinates were measured and plotted in the model as color-coded lesions according to the treatment recommendations. Statistical validation was achieved through leave-one-out cross-validation. Results A total of 611 CRLM in 202 participants (mean age, 63 [range, 29-87] years; 138 male and 64 female) were included. Superficially located CRLM were considered suitable for resection, whereas more deep-seated CRLM were preferably ablated, with the transition zone at a subsurface depth of 3 cm. Ninety-three percent (25 of 27) of perihilar CRLM treated with IRE were at least partially located within 1 cm from the portal triad. Use of the model correctly predicted the preferred treatment in 313 of 424 CRLM (73.8%). Conclusion The results suggest that CRLM can be defined as superficial (preferably resected) and deep-seated (preferably ablated) if the tumor center is within versus beyond 3 cm from the liver surface, respectively, and as perihilar if the tumor margins extend to within 1 cm from the portal triad. Ablation Techniques, CT, MRI, Liver, Abdomen/GI, Metastases, Oncology © RSNA, 2024.

55The balance between artificial and human intelligence in clinical practice.PubMed

Domenico Marrella, Turkka Anttila, Jorma Ryhänen, et al.
J Hand Surg Eur Vol. 2026 Jan 15:17531934251401382. doi: 10.1177/17531934251401382.
INTRODUCTION: Artificial intelligence (AI) is becoming increasingly integrated into clinical care in hand surgery. Its applications extend across diagnosis, planning, intraoperative assistance, postoperative monitoring, rehabilitation, prosthetics and education. APPLICATIONS: In diagnostic imaging, AI improves the detection of distal radius and scaphoid fractures, estimates osteoporosis from hand radiographs, identifies triangular fibrocartilage complex injuries on magnetic resonance imaging, segments bones and cartilage, and supports dynamic wrist analysis; ultrasound- and neurophysiological-based models aid carpal tunnel syndrome diagnosis. Prognostic models predict outcomes after carpal tunnel release and thumb carpometacarpal osteoarthritis with mixed performance. Pre- and intraoperative applications include large language model-based triage and coding, navigation and phase/gesture recognition from surgical video, autonomous microsurgical prototypes and telemanipulator platforms for supermicrosurgery. Artificial intelligence-enabled telemonitoring (e.g. remote photoplethysmography) and video-based mobility tracking support postoperative care and rehabilitation. Vision-guided and multimodal sensing enhance myoelectric prosthesis control. RISKS: Risks include data privacy and security, algorithmic bias (data, transposition, normative, annotation) and opacity, overreliance with automation bias and skill erosion, and unresolved legal and ethical questions (liability, conflicts of interest, compassion in care). CONCLUSION: Balanced adoption requires diversified datasets, privacy-preserving strategies (pseudonymization, differential privacy, federated learning), transparent reporting, AI literacy and ethics in medical education and interfaces that expose uncertainty and employ cognitive forcing functions. Post-deployment surveillance should track data drift, out-of-distribution inputs and performance using automated alerts and multidisciplinary review. Artificial intelligence should augment, never replace, clinical judgment, with explicit role delineation and continuous monitoring to safeguard equity and patient-centred outcomes.

56Endoscopic assessment of the oesophageal features of eosinophilic oesophagitis: validation of a novel classification and grading system.PubMed

Ikuo Hirano, Nelson Moy, Michael G Heckman, et al.
Gut. 2013 Apr;62(4):489-95. doi: 10.1136/gutjnl-2011-301817. Epub 2012 May 22.
OBJECTIVE: Abnormalities are commonly identified during endoscopy in eosinophilic oesophagitis (EoE). There is no standardised classification to describe these features. This study aimed to evaluate the interobserver agreement of a grading system for the oesophageal features of EoE. METHOD: The proposed system incorporated the grading of four major oesophageal features (rings, furrows, exudates, oedema) and the presence of additional features of narrow calibre oesophagus, feline oesophagus, stricture and crepe paper oesophagus. Endoscopic videos from 25 patients with EoE and controls were reviewed by 21 gastroenterologists. Interobserver agreement was assessed by estimating multi-rater κ and the proportion of pairwise agreement. RESULTS: Using the original grading system, agreement for rings, furrows and exudates was moderate (κ=0.38-0.46, 56-65% agreement) but poor for oedema (κ=0.23, 51% agreement). Identification of narrow calibre oesophagus had fair agreement (κ=0.30, 74% agreement) while feline oesophagus had poor agreement (κ=0.15, 68% agreement). After collapsing the severity grading for oedema and furrows and eliminating poorly performing features of feline oesophagus and narrow calibre oesophagus, a modified grading system demonstrated good agreement for the four major features of EoE (κ=0.40-0.54, 71-81% agreement) and additional features of stricture and crepe paper oesophagus (κ=0.52 and 0.58, 79% and 92% agreement). CONCLUSIONS: The proposed system for endoscopically-identified oesophageal features of EoE defines common nomenclature and severity scores for the assessment of EoE disease activity. The system has good interobserver agreement among practising and academic gastroenterologists.

57An artificial intelligence-enhanced coaching mode.PubMed

Ke Cheng, Shangdi Wu, Bing Peng, et al.
Int J Surg. 2025 Sep 1;111(9):6469-6472. doi: 10.1097/JS9.0000000000002713. Epub 2025 Jun 23.
Surgical coaching has emerged as an innovative educational strategy designed to enhance both the technical and nontechnical competencies of surgeons through structured, individualized feedback. As minimally invasive surgical techniques continue to proliferate, video-based coaching has proven effective for skill refinement. However, its broader implementation remains limited due to a shortage of expert coaches and the labor-intensive nature of video review. Advances in artificial intelligence (AI), particularly in the field of computer vision (CV), present promising opportunities to optimize surgical coaching by automating video analysis and enabling scalable, data-driven feedback mechanisms. This study introduces SmartCoach, an AI-assisted surgical coaching program designed to support laparoscopic pancreatoduodenectomy - a technically demanding procedure typically reserved for highly experienced surgeons. The program integrates an intelligent visualization system and structured postoperative debriefings to identify key performance issues and foster targeted improvement strategies. Preliminary survey data revealed limited awareness among participating surgeons regarding surgical coaching principles and the role of AI in surgical education. While most reported frequent use of operative videos for learning, they cited the lack of expert feedback and inefficiency as major barriers. The AI-driven coaching model seeks to address these challenges by providing real-time intraoperative assessments, automated identification of surgical steps, and enhanced scalability facilitated by 5G-enabled communication technologies. Despite its promise, the implementation of AI-based coaching faces ethical, logistical, and cultural obstacles, including data privacy concerns and resistance to change among experienced surgeons. Nonetheless, the integration of AI into surgical coaching represents a transformative step toward improving operative performance, surgeon well-being, and patient outcomes, particularly in highly complex procedures where expert support is often limited.

58HIPAA and video recordings in the clinical setting.PubMed

Gayla Miles, Anne Quinlan
Nursing. 2023 Jan 1;53(1):15-19. doi: 10.1097/01.NURSE.0000902940.51519.50.
The advent of cellular network technology has increased the use of photography in the clinical setting. This article reviews several areas regarding protected health information (PHI) and the use of video: the 1996 Health Insurance Portability and Accountability Act (HIPAA); The Joint Commission requirements for the use of images; areas of concern for exchanging PHI with law enforcement at the bedside, and the need for the development of formal guidelines regarding the use of video in the clinical setting.

59Evaluation of Federated Learning Using Standardized EHR Data in Japan.PubMed

Koutarou Matsumoto, Saori Tou, Yuta Nakamura, et al.
Stud Health Technol Inform. 2025 Aug 7;329:1034-1038. doi: 10.3233/SHTI250996.
This study addresses privacy concerns in multi-institutional data sharing by applying federated learning (FL) to develop a predictive model for prolonged air leaks (PAL) following video-assisted thoracoscopic surgery (VATS). Utilizing standardized electronic health record (EHR) data from two hospitals in Japan, we maintained a high discriminatory accuracy by exchanging only model parameters without sharing the underlying data, thereby ensuring patient privacy. These results suggest that FL can improve the accuracy of predictive models in healthcare while protecting data privacy. However, compared with models developed using centralized data, models built with FL tended to be more influenced by facilities with larger case volumes, indicating the need for further validation.

60Sample sizes and statistical methods in interventional studies on individuals with spinal cord injury: A systematic review.PubMed

Georg Zimmermann, Lisa-Maria Bolter, Ronny Sluka, et al.
J Evid Based Med. 2019 Aug;12(3):200-208. doi: 10.1111/jebm.12356. Epub 2019 Jun 23.
AIM: Prevalence and incidence of spinal cord injury (SCI) are low. However, sample sizes have not been systematically examined yet, although this might represent useful information for study planning and power considerations. Therefore, our objective was to determine the median sample size in clinical trials on SCI individuals. Moreover, within small-sample size studies, statistical methods and awareness of potential problems regarding small samples were examined. METHODS: We systematically reviewed all studies on human SCI individuals published between 2014 and 2015, where the effect of an intervention on one or more health-related outcomes was assessed by means of a hypothesis test. If at least one group had a size <20, the study was classified as a small sample size study. PubMed was searched for eligible studies; subsequently, data on sample sizes and statistical methods were extracted and summarized descriptively. RESULTS: Out of 8897 studies 207 were included. Median total sample size was 18 (range 4-582). Small sample sizes were found in 167/207 (81%) studies, resulting limitations and implications for statistical analyses were mentioned in 109/167 (65%) studies. CONCLUSIONS: Although most recent SCI trials have been conducted with small samples, the consequences on statistical analysis methods and the validity of the results are rarely acknowledged.

61Controllable illumination invariant GAN for diverse temporally-consistent surgical video synthesis.PubMed

Long Chen, Mobarak I Hoque, Zhe Min, et al.
Med Image Anal. 2025 Oct;105:103731. doi: 10.1016/j.media.2025.103731. Epub 2025 Jul 25.
Surgical video synthesis offers a cost-effective way to expand training data and enhance the performance of machine learning models in computer-assisted surgery. However, existing video translation methods often produce video sequences with large illumination changes across different views, disrupting the temporal consistency of the videos. Additionally, these methods typically synthesize videos with a monotonous style, whereas diverse synthetic data is desired to improve the generalization ability of downstream machine learning models. To address these challenges, we propose a novel Controllable Illumination Invariant Generative Adversarial Network (CIIGAN) for generating diverse, illumination-consistent video sequences. CIIGAN fuses multi-scale illumination-invariant features from a novel controllable illumination-invariant (CII) image space with multi-scale texture-invariant features from self-constructed 3D scenes. The CII image space, along with the 3D scenes, allows CIIGAN to produce diverse and temporally-consistent video or image translations. Extensive experiments demonstrate that CIIGAN achieves more realistic and illumination-consistent translations compared to previous state-of-the-art baselines. Furthermore, the segmentation networks trained on our diverse synthetic data outperform those trained on monotonous synthetic data. Our source code, well-trained models, and 3D simulation scenes are public available at https://github.com/LongChenCV/CIIGAN.

62High Efficiency Video Coding (HEVC)-Based Surgical Telementoring System Using Shallow Convolutional Neural Network.PubMed

Ali Hassan, Mubeen Ghafoor, Syed Ali Tariq, et al.
J Digit Imaging. 2019 Dec;32(6):1027-1043. doi: 10.1007/s10278-019-00206-2.
Surgical telementoring systems have gained lots of interest, especially in remote locations. However, bandwidth constraint has been the primary bottleneck for efficient telementoring systems. This study aims to establish an efficient surgical telementoring system, where the qualified surgeon (mentor) provides real-time guidance and technical assistance for surgical procedures to the on-spot physician (surgeon). High Efficiency Video Coding (HEVC/H.265)-based video compression has shown promising results for telementoring applications. However, there is a trade-off between the bandwidth resources required for video transmission and quality of video received by the remote surgeon. In order to efficiently compress and transmit real-time surgical videos, a hybrid lossless-lossy approach is proposed where surgical incision region is coded in high quality whereas the background region is coded in low quality based on distance from the surgical incision region. For surgical incision region extraction, state-of-the-art deep learning (DL) architectures for semantic segmentation can be used. However, the computational complexity of these architectures is high resulting in large training and inference times. For telementoring systems, encoding time is crucial; therefore, very deep architectures are not suitable for surgical incision extraction. In this study, we propose a shallow convolutional neural network (S-CNN)-based segmentation approach that consists of encoder network only for surgical region extraction. The segmentation performance of S-CNN is compared with one of the state-of-the-art image segmentation networks (SegNet), and results demonstrate the effectiveness of the proposed network. The proposed telementoring system is efficient and explicitly considers the physiological nature of the human visual system to encode the video by providing good overall visual impact in the location of surgery. The results of the proposed S-CNN-based segmentation demonstrated a pixel accuracy of 97% and a mean intersection over union accuracy of 79%. Similarly, HEVC experimental results showed that the proposed surgical region-based encoding scheme achieved an average bitrate reduction of 88.8% at high-quality settings in comparison with default full-frame HEVC encoding. The average gain in encoding performance (signal-to-noise) of the proposed algorithm is 11.5 dB in the surgical region. The bitrate saving and visual quality of the proposed optimal bit allocation scheme are compared with the mean shift segmentation-based coding scheme for fair comparison. The results show that the proposed scheme maintains high visual quality in surgical incision region along with achieving good bitrate saving. Based on comparison and results, the proposed encoding algorithm can be considered as an efficient and effective solution for surgical telementoring systems for low-bandwidth networks.

63Application of artificial intelligence in gastroenterology.PubMed

Young Joo Yang, Chang Seok Bang
World J Gastroenterol. 2019 Apr 14;25(14):1666-1683. doi: 10.3748/wjg.v25.i14.1666.
Artificial intelligence (AI) using deep-learning (DL) has emerged as a breakthrough computer technology. By the era of big data, the accumulation of an enormous number of digital images and medical records drove the need for the utilization of AI to efficiently deal with these data, which have become fundamental resources for a machine to learn by itself. Among several DL models, the convolutional neural network showed outstanding performance in image analysis. In the field of gastroenterology, physicians handle large amounts of clinical data and various kinds of image devices such as endoscopy and ultrasound. AI has been applied in gastroenterology in terms of diagnosis, prognosis, and image analysis. However, potential inherent selection bias cannot be excluded in the form of retrospective study. Because overfitting and spectrum bias (class imbalance) have the possibility of overestimating the accuracy, external validation using unused datasets for model development, collected in a way that minimizes the spectrum bias, is mandatory. For robust verification, prospective studies with adequate inclusion/exclusion criteria, which represent the target populations, are needed. DL has its own lack of interpretability. Because interpretability is important in that it can provide safety measures, help to detect bias, and create social acceptance, further investigations should be performed.

64Interpretable machine learning model for predicting post-hepatectomy liver failure in hepatocellular carcinoma.PubMed

Tianzhi Tang, Tianyu Guo, Bo Zhu, et al.
Sci Rep. 2025 May 3;15(1):15469. doi: 10.1038/s41598-025-97878-4.
Post-hepatectomy liver failure (PHLF) is a severe complication following liver surgery. We aimed to develop a novel, interpretable machine learning (ML) model to predict PHLF. We enrolled 312 hepatocellular carcinoma (HCC) patients who underwent hepatectomy, and 30% of the samples were utilized for internal validation. Variable selection was performed using the least absolute shrinkage and selection operator regression in conjunction with random forest and recursive feature elimination (RF-RFE) algorithms. Subsequently, 12 distinct ML algorithms were employed to identify the optimal prediction model. The area under the receiver operating characteristic curve, calibration curves, and decision curve analysis (DCA) were utilized to assess the model's predictive accuracy. Additionally, an independent prospective validation was conducted with 62 patients. The SHapley Additive exPlanations (SHAP) analysis further explained the extreme gradient boosting (XGBoost) model. The XGBoost model exhibited the highest accuracy with AUCs of 0.983 and 0.981 in the training and validation cohorts among 12 ML models. Calibration curves and DCA confirmed the model's accuracy and clinical applicability. Compared with traditional models, the XGBoost model had a higher AUC. The prospective cohort (AUC = 0.942) further confirmed the generalization ability of the XGBoost model. SHAP identified the top three critical variables: total bilirubin (TBIL), MELD score, and ICG-R15. Moreover, the SHAP summary plot was used to illustrate the positive or negative effects of the features as influenced by XGBoost. The XGBoost model provides a good preoperative prediction of PHLF in patients with resectable HCC.

65Machine Learning Prediction for Spinal Deformity Surgery Blood Transfusion.PubMed

Meijia Luo, Xiaotian Lei, Zhendong Ding, et al.
World Neurosurg. 2025 Nov;203:124468. doi: 10.1016/j.wneu.2025.124468. Epub 2025 Sep 16.
BACKGROUND: Spinal deformity surgery (SDS) is usually accompanied by significant intraoperative blood loss and transfusion, which is not without risk, as transfusions can lead to transfusion reactions, transmission of infections, and immunosuppression. Therefore, limiting unnecessary intraoperative blood transfusion (IBT) by accurately predicting transfusion requirements is an important goal. METHODS: Include patients with spinal deformities who received SDS at 11 large medical centers in China from 2012 to 2022. A total of 162 cases were randomized into a training cohort (70%) and a testing cohort (30%) with an outcome of IBT. A total of 39 candidate factors were collected, including basic personal data, medical comorbidities, surgery-related indicators, and preoperative blood draw indicators, among others. Lasso regression was used to screen potential modeling features. Ten ML algorithms incorporated include logistic regression, decision tree, elastic network, k-nearest neighbor, neural networks, Light Gradient Boosting Machine, random forest (RF), eXtreme Gradient Boosting, support vector machine, and stacking ensemble model. The performance of these models was evaluated using receiver operating characteristic (ROC) curve, precision-recall, calibration, and decision curve analysis. In addition, SHapley Additive exPlanation was applied to interpret the predictive models. Finally, a web calculator and logistic analysis were created to quantify the hazard level of the features. RESULTS: By comparing the training group, validation group, and multiple parameter comparisons, the RF model had the strongest performance generalization ability (area under the curve [AUC] of ROC: 0.8716; AUC of precision recall: 0.8246; Brier score of calibration curve: 0.142). Seven key variables were determined including age, body mass index, preoperative hematocrit, fibrinogen, prefunction, bone graft, and number of levels fusion. Finally, logistics determined that level 4 vertebral fusion surgery may have the greatest IBT risk (odds ratio = 20.78, 95% confidence interval 3.9-110.83; P < 0.001). A web calculator has also been established for clinical personnel to assess the risk of IBT. CONCLUSIONS: In this study, multiple ML algorithms were successfully established to predict the risk of IBT in SDS, thereby making reasonable use of blood resources and optimizing blood transfusion strategies.

66Ethical and regulatory challenges of large language models in medicine.PubMed

Jasmine Chiat Ling Ong, Shelley Yin-Hsi Chang, Wasswa William, et al.
Lancet Digit Health. 2024 Jun;6(6):e428-e432. doi: 10.1016/S2589-7500(24)00061-X. Epub 2024 Apr 23.
With the rapid growth of interest in and use of large language models (LLMs) across various industries, we are facing some crucial and profound ethical concerns, especially in the medical field. The unique technical architecture and purported emergent abilities of LLMs differentiate them substantially from other artificial intelligence (AI) models and natural language processing techniques used, necessitating a nuanced understanding of LLM ethics. In this Viewpoint, we highlight ethical concerns stemming from the perspectives of users, developers, and regulators, notably focusing on data privacy and rights of use, data provenance, intellectual property contamination, and broad applications and plasticity of LLMs. A comprehensive framework and mitigating strategies will be imperative for the responsible integration of LLMs into medical practice, ensuring alignment with ethical principles and safeguarding against potential societal risks.

67Ethical Considerations of Using ChatGPT in Health Care.PubMed

Changyu Wang, Siru Liu, Hao Yang, et al.
J Med Internet Res. 2023 Aug 11;25:e48009. doi: 10.2196/48009.
ChatGPT has promising applications in health care, but potential ethical issues need to be addressed proactively to prevent harm. ChatGPT presents potential ethical challenges from legal, humanistic, algorithmic, and informational perspectives. Legal ethics concerns arise from the unclear allocation of responsibility when patient harm occurs and from potential breaches of patient privacy due to data collection. Clear rules and legal boundaries are needed to properly allocate liability and protect users. Humanistic ethics concerns arise from the potential disruption of the physician-patient relationship, humanistic care, and issues of integrity. Overreliance on artificial intelligence (AI) can undermine compassion and erode trust. Transparency and disclosure of AI-generated content are critical to maintaining integrity. Algorithmic ethics raise concerns about algorithmic bias, responsibility, transparency and explainability, as well as validation and evaluation. Information ethics include data bias, validity, and effectiveness. Biased training data can lead to biased output, and overreliance on ChatGPT can reduce patient adherence and encourage self-diagnosis. Ensuring the accuracy, reliability, and validity of ChatGPT-generated content requires rigorous validation and ongoing updates based on clinical practice. To navigate the evolving ethical landscape of AI, AI in health care must adhere to the strictest ethical standards. Through comprehensive ethical guidelines, health care professionals can ensure the responsible use of ChatGPT, promote accurate and reliable information exchange, protect patient privacy, and empower patients to make informed decisions about their health care.

68Ethical considerations for artificial intelligence in dermatology: a scoping review.PubMed

Emily R Gordon, Megan H Trager, Despina Kontos, et al.
Br J Dermatol. 2024 May 17;190(6):789-797. doi: 10.1093/bjd/ljae040.
The field of dermatology is experiencing the rapid deployment of artificial intelligence (AI), from mobile applications (apps) for skin cancer detection to large language models like ChatGPT that can answer generalist or specialist questions about skin diagnoses. With these new applications, ethical concerns have emerged. In this scoping review, we aimed to identify the applications of AI to the field of dermatology and to understand their ethical implications. We used a multifaceted search approach, searching PubMed, MEDLINE, Cochrane Library and Google Scholar for primary literature, following the PRISMA Extension for Scoping Reviews guidance. Our advanced query included terms related to dermatology, AI and ethical considerations. Our search yielded 202 papers. After initial screening, 68 studies were included. Thirty-two were related to clinical image analysis and raised ethical concerns for misdiagnosis, data security, privacy violations and replacement of dermatologist jobs. Seventeen discussed limited skin of colour representation in datasets leading to potential misdiagnosis in the general population. Nine articles about teledermatology raised ethical concerns, including the exacerbation of health disparities, lack of standardized regulations, informed consent for AI use and privacy challenges. Seven addressed inaccuracies in the responses of large language models. Seven examined attitudes toward and trust in AI, with most patients requesting supplemental assessment by a physician to ensure reliability and accountability. Benefits of AI integration into clinical practice include increased patient access, improved clinical decision-making, efficiency and many others. However, safeguards must be put in place to ensure the ethical application of AI.

69ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns.PubMed

Malik Sallam
Healthcare (Basel). 2023 Mar 19;11(6):887. doi: 10.3390/healthcare11060887.
ChatGPT is an artificial intelligence (AI)-based conversational large language model (LLM). The potential applications of LLMs in health care education, research, and practice could be promising if the associated valid concerns are proactively examined and addressed. The current systematic review aimed to investigate the utility of ChatGPT in health care education, research, and practice and to highlight its potential limitations. Using the PRIMSA guidelines, a systematic search was conducted to retrieve English records in PubMed/MEDLINE and Google Scholar (published research or preprints) that examined ChatGPT in the context of health care education, research, or practice. A total of 60 records were eligible for inclusion. Benefits of ChatGPT were cited in 51/60 (85.0%) records and included: (1) improved scientific writing and enhancing research equity and versatility; (2) utility in health care research (efficient analysis of datasets, code generation, literature reviews, saving time to focus on experimental design, and drug discovery and development); (3) benefits in health care practice (streamlining the workflow, cost saving, documentation, personalized medicine, and improved health literacy); and (4) benefits in health care education including improved personalized learning and the focus on critical thinking and problem-based learning. Concerns regarding ChatGPT use were stated in 58/60 (96.7%) records including ethical, copyright, transparency, and legal issues, the risk of bias, plagiarism, lack of originality, inaccurate content with risk of hallucination, limited knowledge, incorrect citations, cybersecurity issues, and risk of infodemics. The promising applications of ChatGPT can induce paradigm shifts in health care education, research, and practice. However, the embrace of this AI chatbot should be conducted with extreme caution considering its potential limitations. As it currently stands, ChatGPT does not qualify to be listed as an author in scientific articles unless the ICMJE/COPE guidelines are revised or amended. An initiative involving all stakeholders in health care education, research, and practice is urgently needed. This will help to set a code of ethics to guide the responsible use of ChatGPT among other LLMs in health care and academia.

70Ethics of Artificial Intelligence in Medicine.PubMed

Min Kyu Park, Neil Ashwood, Neil Capes
Cureus. 2025 May 6;17(5):e83567. doi: 10.7759/cureus.83567. eCollection 2025 May.
Artificial intelligence (AI) has shown great promise in becoming an integral part of healthcare, offering advancements in diagnostic accuracy, surgical precision, and personalised patient care in numerous medical specialties, including radiology and surgery. This paper explores the ethical implications of AI in medicine, with emphasis on the four key ethical principles of autonomy, beneficence, non-maleficence, and justice. The ethical challenges include concerns about patient consent, data privacy, clinical transparency, and the potential for AI to exacerbate health disparities. This paper explores the need for clear ethical guidelines and regulatory frameworks to ensure AI is used in a way that enhances healthcare without compromising ethical standards. As technology continues to advance, it is crucial to balance technological advancement with the fundamental principles of medical ethics to ensure that healthcare is delivered in a safe and compassionate manner.

71Privacy-preserving artificial intelligence in healthcare: Techniques and applications.PubMed

Nazish Khalid, Adnan Qayyum, Muhammad Bilal, et al.
Comput Biol Med. 2023 May;158:106848. doi: 10.1016/j.compbiomed.2023.106848. Epub 2023 Apr 5.
There has been an increasing interest in translating artificial intelligence (AI) research into clinically-validated applications to improve the performance, capacity, and efficacy of healthcare services. Despite substantial research worldwide, very few AI-based applications have successfully made it to clinics. Key barriers to the widespread adoption of clinically validated AI applications include non-standardized medical records, limited availability of curated datasets, and stringent legal/ethical requirements to preserve patients' privacy. Therefore, there is a pressing need to improvise new data-sharing methods in the age of AI that preserve patient privacy while developing AI-based healthcare applications. In the literature, significant attention has been devoted to developing privacy-preserving techniques and overcoming the issues hampering AI adoption in an actual clinical environment. To this end, this study summarizes the state-of-the-art approaches for preserving privacy in AI-based healthcare applications. Prominent privacy-preserving techniques such as Federated Learning and Hybrid Techniques are elaborated along with potential privacy attacks, security challenges, and future directions.

72Privacy-Preserving Artificial Intelligence Techniques in Biomedicine.PubMed

Reihaneh Torkzadehmahani, Reza Nasirigerdeh, David B Blumenthal, et al.
Methods Inf Med. 2022 Jun;61(S 01):e12-e27. doi: 10.1055/s-0041-1740630. Epub 2022 Jan 21.
BACKGROUND: Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g., in the interpretation of next-generation sequencing data and in the design of clinical decision support systems. OBJECTIVES: However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary statistics of a genome-wide association study can be used to determine the presence or absence of an individual in a given dataset. This considerable privacy risk has led to restrictions in accessing genomic and other biomedical data, which is detrimental for collaborative research and impedes scientific progress. Hence, there has been a substantial effort to develop AI methods that can learn from sensitive data while protecting individuals' privacy. METHOD: This paper provides a structured overview of recent advances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within a unified taxonomy and discusses their strengths, limitations, and open problems. CONCLUSION: As the most promising direction, we suggest combining federated machine learning as a more scalable approach with other additional privacy-preserving techniques. This would allow to merge the advantages to provide privacy guarantees in a distributed way for biomedical applications. Nonetheless, more research is necessary as hybrid approaches pose new challenges such as additional network or computation overhead.

73Privacy in the age of medical big data.PubMed

W Nicholson Price, I Glenn Cohen
Nat Med. 2019 Jan;25(1):37-43. doi: 10.1038/s41591-018-0272-7. Epub 2019 Jan 7.
Big data has become the ubiquitous watch word of medical innovation. The rapid development of machine-learning techniques and artificial intelligence in particular has promised to revolutionize medical practice from the allocation of resources to the diagnosis of complex diseases. But with big data comes big risks and challenges, among them significant questions about patient privacy. Here, we outline the legal and ethical challenges big data brings to patient privacy. We discuss, among other topics, how best to conceive of health privacy; the importance of equity, consent, and patient governance in data collection; discrimination in data uses; and how to handle data breaches. We close by sketching possible ways forward for the regulatory system.

74Applications of Federated Learning in Mobile Health: Scoping Review.PubMed

Tongnian Wang, Yan Du, Yanmin Gong, et al.
J Med Internet Res. 2023 May 1;25:e43006. doi: 10.2196/43006.
BACKGROUND: The proliferation of mobile health (mHealth) applications is partly driven by the advancements in sensing and communication technologies, as well as the integration of artificial intelligence techniques. Data collected from mHealth applications, for example, on sensor devices carried by patients, can be mined and analyzed using artificial intelligence-based solutions to facilitate remote and (near) real-time decision-making in health care settings. However, such data often sit in data silos, and patients are often concerned about the privacy implications of sharing their raw data. Federated learning (FL) is a potential solution, as it allows multiple data owners to collaboratively train a machine learning model without requiring access to each other's raw data. OBJECTIVE: The goal of this scoping review is to gain an understanding of FL and its potential in dealing with sensitive and heterogeneous data in mHealth applications. Through this review, various stakeholders, such as health care providers, practitioners, and policy makers, can gain insight into the limitations and challenges associated with using FL in mHealth and make informed decisions when considering implementing FL-based solutions. METHODS: We conducted a scoping review following the guidelines of PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews). We searched 7 commonly used databases. The included studies were analyzed and summarized to identify the possible real-world applications and associated challenges of using FL in mHealth settings. RESULTS: A total of 1095 articles were retrieved during the database search, and 26 articles that met the inclusion criteria were included in the review. The analysis of these articles revealed 2 main application areas for FL in mHealth, that is, remote monitoring and diagnostic and treatment support. More specifically, FL was found to be commonly used for monitoring self-care ability, health status, and disease progression, as well as in diagnosis and treatment support of diseases. The review also identified several challenges (eg, expensive communication, statistical heterogeneity, and system heterogeneity) and potential solutions (eg, compression schemes, model personalization, and active sampling). CONCLUSIONS: This scoping review has highlighted the potential of FL as a privacy-preserving approach in mHealth applications and identified the technical limitations associated with its use. The challenges and opportunities outlined in this review can inform the research agenda for future studies in this field, to overcome these limitations and further advance the use of FL in mHealth.

75Federated AI, Current State, and Future Potential.PubMed

Phoebe Clark, Eric K Oermann, Dinah Chen, et al.
Asia Pac J Ophthalmol (Phila). 2023;12(3):310-314. doi: 10.1097/APO.0000000000000614. Epub 2023 May 31.
Artificial intelligence and machine learning applications are becoming increasingly popular in health care and medical devices. The development of accurate machine learning algorithms requires large quantities of good and diverse data. This poses a challenge in health care because of the sensitive nature of sharing patient data. Decentralized algorithms through federated learning avoid data aggregation. In this paper we give an overview of federated learning, current examples in healthcare and ophthalmology, challenges, and next steps.

76Methods and Impact for Using Federated Learning to Collaborate on Clinical Research.PubMed

Alexander T M Cheung, Mustafa Nasir-Moin, Young Joon Fred Kwon, et al.
Neurosurgery. 2023 Feb 1;92(2):431-438. doi: 10.1227/neu.0000000000002198. Epub 2022 Nov 8.
BACKGROUND: The development of accurate machine learning algorithms requires sufficient quantities of diverse data. This poses a challenge in health care because of the sensitive and siloed nature of biomedical information. Decentralized algorithms through federated learning (FL) avoid data aggregation by instead distributing algorithms to the data before centrally updating one global model. OBJECTIVE: To establish a multicenter collaboration and assess the feasibility of using FL to train machine learning models for intracranial hemorrhage (ICH) detection without sharing data between sites. METHODS: Five neurosurgery departments across the United States collaborated to establish a federated network and train a convolutional neural network to detect ICH on computed tomography scans. The global FL model was benchmarked against a standard, centrally trained model using a held-out data set and was compared against locally trained models using site data. RESULTS: A federated network of practicing neurosurgeon scientists was successfully initiated to train a model for predicting ICH. The FL model achieved an area under the ROC curve of 0.9487 (95% CI 0.9471-0.9503) when predicting all subtypes of ICH compared with a benchmark (non-FL) area under the ROC curve of 0.9753 (95% CI 0.9742-0.9764), although performance varied by subtype. The FL model consistently achieved top three performance when validated on any site's data, suggesting improved generalizability. A qualitative survey described the experience of participants in the federated network. CONCLUSION: This study demonstrates the feasibility of implementing a federated network for multi-institutional collaboration among clinicians and using FL to conduct machine learning research, thereby opening a new paradigm for neurosurgical collaboration.

77MFF-Net: A Lightweight Multi-Frequency Network for Measuring Heart Rhythm from Facial Videos.PubMed

Wenqin Yan, Jialiang Zhuang, Yuheng Chen, et al.
Sensors (Basel). 2024 Dec 12;24(24):7937. doi: 10.3390/s24247937.
Remote photo-plethysmography (rPPG) is a useful camera-based health motioning method that can measure the heart rhythm from facial videos. Many well-established deep learning models can provide highly accurate and robust results in measuring heart rate (HR) and heart rate variability (HRV). However, these methods are unable to effectively eliminate illumination variation and motion artifact disturbances, and their substantial computational resource requirements significantly limit their applicability in real-world scenarios. Hence, we propose a lightweight multi-frequency network named MFF-Net to measure heart rhythm via facial videos in a short time. Firstly, we propose a multi-frequency mode signal fusion (MFF) mechanism, which can separate the characteristics of different modes of the original rPPG signals and send them to a processor with independent parameters, helping the network recover blood volume pulse (BVP) signals accurately under a complex noise environment. In addition, in order to help the network extract the characteristics of different modal signals effectively, we designed a temporal multiscale convolution module (TMSC-module) and spectrum self-attention module (SSA-module). The TMSC-module can expand the receptive field of the signal-refining network, obtain more abundant multiscale information, and transmit it to the signal reconstruction network. The SSA-module can help a signal reconstruction network locate the obvious inferior parts in the reconstruction process so as to make better decisions when merging multi-dimensional signals. Finally, in order to solve the over-fitting phenomenon that easily occurs in the network, we propose an over-fitting sampling training scheme to further improve the fitting ability of the network. Comprehensive experiments were conducted on three benchmark datasets, and we estimated HR and HRV based on the BVP signals derived by MFF-Net. Compared with state-of-the-art methods, our approach achieves better performance both on HR and HRV estimation with lower computational burden. We can conclude that the proposed MFF-Net has the opportunity to be applied in many real-world scenarios.

78Learning multi-modal representations by watching hundreds of surgical video lectures.PubMed

Kun Yuan, Vinkle Srivastav, Tong Yu, et al.
Med Image Anal. 2025 Oct;105:103644. doi: 10.1016/j.media.2025.103644. Epub 2025 Jun 4.
Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively demonstrate the representational capability of the learned joint latent space, we introduce several vision-and-language surgical tasks and evaluate various vision-only tasks specific to surgery, e.g., surgical tool, phase, and triplet recognition. Extensive experiments across diverse surgical procedures and tasks demonstrate that the multi-modal representations learned by SurgVLP exhibit strong transferability and adaptability in surgical video analysis. Furthermore, our zero-shot evaluations highlight SurgVLP's potential as a general-purpose foundation model for surgical workflow analysis, reducing the reliance on extensive manual annotations for downstream tasks, and facilitating adaptation methods such as few-shot learning to build a scalable and data-efficient solution for various downstream surgical applications. The code is available at https://github.com/CAMMA-public/SurgVLP.

79Navigating the AI Landscape in Medical Imaging: A Critical Analysis of Technologies, Implementation, and Implications.PubMed

Jacob Sosna, Leo Joskowicz, Mor Saban
Radiology. 2025 Jun;315(3):e240982. doi: 10.1148/radiol.240982.
The growing volume and complexity of medical imaging outpaces the available radiologist workforce, risking timely diagnosis. Comprehensive artificial intelligence (AI) that integrates multimodal imaging data, clinical notes, and large language models has the potential to support radiologists. Accordingly, the U.S. Food and Drug Administration has cleared more than 770 AI medical devices that focus on radiology, primarily based on deep learning. However, algorithm development and validation remain challenging. Limitations include sparse expert-annotated data and regulatory hurdles. Clinical implementation and the adaptation of the radiologic community is also lagging behind. Additionally, technical barriers exist regarding data availability, large language model explainability, deep learning model generalization, and clinical integration. Advances in few-shot learning, self-supervised models, and centralized platforms may support consolidated AI ecosystems. Although progress has been made, much work is still needed on data infrastructure, responsible clinical translation, and workflow integration. Continuous multidisciplinary efforts are required to optimize AI safety and truly augment radiologists' work through comprehensive solutions. By overcoming the remaining challenges, AI may strengthen health care systems through improved diagnosis. This review addresses integration challenges, pathways for responsible progress, and the viewpoints of all stakeholders.

80A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning.PubMed

Zishan Gu, Fenglin Liu, Jiayuan Chen, et al.
Adv Intell Syst. 2025 Aug;7(8):2400840. doi: 10.1002/aisy.202400840. Epub 2025 Feb 5.
The adoption of large language models (LLMs) in healthcare has garnered significant research interest, yet their performance remains limited due to a lack of domain-specific knowledge, medical reasoning skills, and their unimodal nature, which restricts them to text-only inputs. To address these limitations, we propose MultiMedRes, a multimodal medical collaborative reasoning framework that simulates human physicians' communication by incorporating a learner agent to proactively acquire information from domain-specific expert models. MultiMedRes addresses medical multimodal reasoning problems through three steps i) Inquire: The learner agent decomposes complex medical reasoning problems into multiple domain-specific sub-problems; ii) Interact: The agent engages in iterative "ask-answer" interactions with expert models to obtain domain-specific knowledge; and iii) Integrate: The agent integrates all the acquired domain-specific knowledge to address the medical reasoning problems (e.g., identifying the difference of disease levels and abnormality sizes between medical images). We validate the effectiveness of our method on the task of difference visual question answering for X-ray images. The experiments show that our zero-shot prediction achieves state-of-the-art performance, surpassing fully supervised methods, which demonstrates that MultiMedRes could offer trustworthy and interpretable assistance to physicians in monitoring the treatment progression of patients, paving the way for effective human-AI interaction and collaboration.

81Artificial intelligence-based prediction model for surgical site infection in metastatic spinal disease: a multicenter development and validation study.PubMed

Yunpeng Cui, Xuedong Shi, Qiwei Wang, et al.
Int J Surg. 2025 Oct 1;111(10):6867-6884. doi: 10.1097/JS9.0000000000002806. Epub 2025 Jun 27.
BACKGROUND: Treatment of metastatic spinal disease often involves surgical intervention; however, surgical site infections (SSIs) pose a great challenge for spine surgeons. At present, there is an absence of reliable clinical tools for predicting SSI, which can adversely affect treatment decisions and overall patient management. This study aims to construct and validate an application to stratify the patients at high risk of SSI among those with metastatic spinal disease using an artificial intelligence (AI) approach. METHODS: A total of 667 patients diagnosed with metastatic spinal disease were enrolled in this study to train and validate models. Patients in the model-derivation cohort ( n = 485) from two tertiary medical institutions were randomly divided into two groups at a ratio of 8:2, with the most patients belonging to the model-training group and the remaining patients classified into the model-validation group. External validation was conducted among patients ( n = 182) from another tertiary medical institution. Logistic Regression and five machine learning algorithms, including support vector machine, gradient boosting machine (GBM), K-nearest neighbor (KNN), neural network (NN), and decision tree, were used to train and optimize models. The predictive performance of the models was assessed through both discrimination and calibration. The model demonstrating the best prediction accuracy was selected as the AI platform for assessing the risk of SSI in patients with metastatic spinal disease. To evaluate the clinical utility of our AI model, we conducted a comparative study involving 100 patients undergoing surgery for metastatic spinal disease at one tertiary medical institution. RESULTS: The incidence of SSI in spinal metastasis surgeries was 6.4% in the model derivation cohort and 7.7% in the external validation cohort. Among all models, the GBM model had the highest area under the curve (AUC) value (0.986, 95% confidence interval [CI]: 0.972-1.000), followed by the KNN model (0.962, 95% CI: 0.933-0.991), and the NN model (0.944, 95% CI: 0.914-0.974). The GBM model also had the best prediction performance in terms of accuracy, precision, recall, F1 score, Brier score, and log loss. The calibration curve revealed the GBM model had favorable calibration ability, and decision curve analysis showed the GBM model had significant net clinical benefits in various risk thresholds. External validation generated an AUC value of the model of 0.848 (95% CI: 0.806-0.890). Surgery time, tumor type, and number of comorbidities were identified as the most three influential factors for postoperative SSI. The AI application achieved significantly higher accuracy than clinician assessments (AUC: 0.986 vs. 0.572-0.627, P < 0.001). Sensitivity analysis confirmed robustness across subgroups (e.g. diabetes, visceral metastases). CONCLUSIONS: This study develops and validates an AI tool with strong predictive performance in identifying patients at a high risk for SSI. By facilitating personalized treatment based on risk classification, this advancement has the potential to significantly enhance surgical care for patients with metastatic spinal disease. Future research should focus on integrating this predictive tool into clinical practice and exploring its applicability across diverse patient populations.

82Telementoring and telerobotics in urological surgery.PubMed

Ben Challacombe, Sarah Wheatstone
Curr Urol Rep. 2010 Feb;11(1):22-8. doi: 10.1007/s11934-009-0086-8.
For more than 150 years, doctors have had the ability to transmit medical information to advise and assist their colleagues in remote locations via teleconsultation using a variety of communication modalities. In surgery this has evolved into the telementoring of minimally invasive procedures, particularly, robotic surgery, which have become relatively commonplace in urology. The ultimate progression to true telerobotic surgery, in which remote surgeons independently perform complex and fundamental parts of procedures at long range, is starting to occur. This article discusses the current state of telementoring and telerobotics in urology and examines the pros and cons of this technology at the present time.

83Telesurgery. Remote monitoring and assistance during laparoscopy.PubMed

R E Link, P G Schulam, L R Kavoussi
Urol Clin North Am. 2001 Feb;28(1):177-88. doi: 10.1016/s0094-0143(01)80020-3.
In comparison to open surgery, laparoscopy results in less postoperative pain, shorter hospitalization, more rapid return to the work force, a better cosmetic result, and a lower incidence of postoperative intra-abdominal adhesions. These advantages are indisputable when comparing large series for cholecystectomy and smaller series for pelvic lymph node dissection, nephrectomy, and bladder neck suspension in experienced hands. Urologists have an obligation to explore the application of these methods to urologic disease and to adjust the standard of care accordingly. Several barriers to the expansion of urologic laparoscopic surgery exist. The experience in extirpative and reconstructive urologic procedures is limited when compared with the data on cholecystectomy. These procedures are technically complex and demand advanced laparoscopic skills and familiarity with laparoscopic anatomy. The steep learning curve translates into long operative times and an unacceptably high rate of complications for inexperienced laparoscopic surgeons. Most practicing urologists have no formal training in advanced laparoscopy, and no formal credentialing guidelines exist. Telesurgical technology may provide one solution to this problem. Through telesurgical mentoring, less experienced surgeons with basic laparoscopic skills could receive training in advanced techniques from a world expert without the need for travel. These systems could also be used to proctor laparoscopic cases for credentialing purposes and to provide a more uniform standard of care. This review has outlined some of the exciting progress made in the field of telesurgery over the past 10 years and described some of the technical and legal obstacles that remain to be surmounted. During the 1990s, urologists were at the forefront of innovation in remote telepresence surgery. As the scope of minimally invasive urologic surgery expands during the first few decades of the twenty-first century, telesurgical mentoring should have an increasingly important role.

84The what? How? And Who? Of video based assessment.PubMed

Carla M Pugh, Daniel A Hashimoto, James R Korndorffer
Am J Surg. 2021 Jan;221(1):13-18. doi: 10.1016/j.amjsurg.2020.06.027. Epub 2020 Jun 23.
BACKGROUND: Currently, there is significant variability in the development, implementation and overarching goals of video review for assessment of surgical performance. METHODS: This paper evaluates the current methods in which video review is used for evaluation of surgical performance and identifies which processes are critical for successful, widespread implementation of video-based assessment. RESULTS: Despite the advances in video capture technology and growing interest in video-based assessment, there is a notable gap in the implementation and longitudinal use of formative and summative assessment using video. CONCLUSION: Validity, scalability and discoverability are current but removable barriers to video-based assessment.

85Design and application of compliant mechanisms for surgical tools.PubMed

S Kota, K-J Lu, K Kreiner, et al.
J Biomech Eng. 2005 Nov;127(6):981-9. doi: 10.1115/1.2056561.
This paper introduces the benefits of exploiting elasticity in the engineering design of surgical tools, in general, and of minimally invasive procedures, in particular. Compliant mechanisms are jointless mechanisms that rely on elastic deformation to transmit forces and motion. The lack of traditional joints in these single-piece flexible structures offers many benefits, including the absence of wear debris, pinch points, crevices, and lubrication. Such systems are particularly amenable to embedded sensing for haptic feedback and embedded actuation with active-material actuators. The paper provides an overview of design synthesis methods developed at the Compliant Systems Design Laboratory and focuses specifically on surgical applications. Compliant systems have potential to integrate. well within the constraints of laparoscopic procedures and telerobotic surgery. A load-path representation is used within a genetic algorithm to solve two gripper example problems. In addition, the paper illustrates the design and construction of an organ (kidney) manipulator for use in minimally invasive procedures.