SingHealth.
Duke-NUS Medical School.
Curr Opin Ophthalmol. 2024 Nov 1;35(6):487-493. doi: 10.1097/ICU.0000000000001089. Epub 2024 Aug 26.
Vision Language Models are an emerging paradigm in artificial intelligence that offers the potential to natively analyze both image and textual data simultaneously, within a single model. The fusion of these two modalities is of particular relevance to ophthalmology, which has historically involved specialized imaging techniques such as angiography, optical coherence tomography, and fundus photography, while also interfacing with electronic health records that include free text descriptions. This review then surveys the fast-evolving field of Vision Language Models as they apply to current ophthalmologic research and practice.
Although models incorporating both image and text data have a long provenance in ophthalmology, effective multimodal Vision Language Models are a recent development exploiting advances in technologies such as transformer and autoencoder models.
Vision Language Models offer the potential to assist and streamline the existing clinical workflow in ophthalmology, whether previsit, during, or post-visit. There are, however, also important challenges to be overcome, particularly regarding patient privacy and explainability of model recommendations.
目的综述:视觉语言模型是人工智能领域的一个新兴范例,它具有在单个模型中同时对图像和文本数据进行原生分析的潜力。这两种模式的融合与眼科学特别相关,眼科学历史上涉及专门的成像技术,如血管造影、光学相干断层扫描和眼底摄影,同时还与包括自由文本描述的电子健康记录相连接。因此,本综述调查了视觉语言模型在当前眼科学研究和实践中的快速发展领域。
最新发现:尽管包含图像和文本数据的模型在眼科学中有很长的历史,但有效的多模态视觉语言模型是最近的一项发展,利用了转换器和自动编码器模型等技术的进步。
总结:视觉语言模型有可能帮助和简化眼科学中的现有临床工作流程,无论是在就诊前、就诊期间还是就诊后。然而,也存在着需要克服的重要挑战,特别是涉及到患者隐私和模型推荐的可解释性。