Wu Xinyuan, Wang Lili, Chen Ruoyu, Liu Bowen, Zhang Weiyi, Yang Xi, Feng Yifan, He Mingguang, Shi Danli
School of Optometry, Hong Kong Polytechnic University, Hong Kong SAR, China.
Department of Computing, Hong Kong Polytechnic University, Hong Kong SAR, China.
JAMA Ophthalmol. 2025 Jun 26. doi: 10.1001/jamaophthalmol.2025.1419.
Medical data sharing faces strict restrictions. Text-to-video generation shows potential for creating realistic medical data while preserving privacy, offering a solution for cross-center data sharing and medical education.
To develop and evaluate a text-to-video generative artificial intelligence (AI)-driven model that converts the text of reports into dynamic fundus fluorescein angiography (FFA) videos, enabling visualization of retinal vascular and structural abnormalities.
DESIGN, SETTING, AND PARTICIPANTS: This study retrospectively collected anonymized FFA data from a tertiary hospital in China. The dataset included both the medical records and FFA examinations of patients assessed between November 2016 and December 2019. A text-to-video model was developed and evaluated. The AI-driven model integrated the wavelet-flow variational autoencoder and the diffusion transformer.
The AI-driven model's performance was assessed through objective metrics (Fréchet video distance, learned perceptual image patch similarity score, and visual question answering score [VQAScore]). The domain-specific evaluation for the generated FFA videos was measured by the bidirectional encoder representations from transformers score (BERTScore). Image retrieval was evaluated using a Recall@K score. Each video was rated for quality by 3 ophthalmologists on a scale of 1 (excellent) to 5 (very poor).
A total of 3625 FFA videos were included (2851 videos [78.6%] for training, 387 videos [10.7%] for validation, and 387 videos [10.7%] for testing). The AI-generated FFA videos demonstrated retinal abnormalities from the input text (Fréchet video distance of 2273, a mean learned perceptual image patch similarity score of 0.48 [SD, 0.04], and a mean VQAScore of 0.61 [SD, 0.08]). The domain-specific evaluations showed alignment between the generated videos and textual prompts (mean BERTScore, 0.35 [SD, 0.09]). The Recall@K scores were 0.02 for K = 5, 0.04 for K = 10, and 0.16 for K = 50, yielding a mean score of 0.073, reflecting disparities between AI-generated and real clinical videos and demonstrating privacy-preserving effectiveness. For assessment of visual quality of the FFA videos by the 3 ophthalmologists, the mean score was 1.57 (SD, 0.44).
This study demonstrated that an AI-driven text-to-video model generated FFA videos from textual descriptions, potentially improving visualization for clinical and educational purposes. The privacy-preserving nature of the model may address key challenges in data sharing while trying to ensure compliance with confidentiality standards.
医学数据共享面临严格限制。文本到视频生成在保护隐私的同时显示出创建逼真医学数据的潜力,为跨中心数据共享和医学教育提供了一种解决方案。
开发并评估一种由文本到视频生成的人工智能(AI)驱动模型,该模型将报告文本转换为动态眼底荧光血管造影(FFA)视频,从而实现视网膜血管和结构异常的可视化。
设计、设置和参与者:本研究回顾性收集了中国一家三级医院的匿名FFA数据。数据集包括2016年11月至2019年12月期间接受评估的患者的病历和FFA检查。开发并评估了一个文本到视频模型。该AI驱动模型集成了小波流变分自编码器和扩散变换器。
通过客观指标(弗雷歇视频距离、学习感知图像块相似性分数和视觉问答分数[VQAScore])评估AI驱动模型的性能。通过变换器分数(BERTScore)的双向编码器表示来测量生成的FFA视频的特定领域评估。使用召回率@K分数评估图像检索。3名眼科医生对每个视频的质量进行评分,范围为1(优秀)至5(非常差)。
共纳入3625个FFA视频(2851个视频[78.6%]用于训练,387个视频[10.7%]用于验证,387个视频[10.7%]用于测试)。AI生成的FFA视频从输入文本中展示了视网膜异常(弗雷歇视频距离为2273,平均学习感知图像块相似性分数为0.48[标准差,0.04],平均VQAScore为0.61[标准差,0.08])。特定领域评估显示生成的视频与文本提示之间具有一致性(平均BERTScore,0.35[标准差,0.09])。召回率@K分数在K = 5时为0.02,K = 10时为0.04,K = 50时为0.16,平均分数为0.073,反映了AI生成的视频与真实临床视频之间的差异,并证明了隐私保护的有效性。对于3名眼科医生对FFA视频视觉质量的评估,平均分数为1.57(标准差,0.44)。
本研究表明,一个由AI驱动的文本到视频模型可从文本描述中生成FFA视频,可能改善临床和教育目的的可视化效果。该模型的隐私保护特性可能在确保符合保密标准的同时解决数据共享中的关键挑战。