Department of Engineering Science, University of Oxford, Oxford, UK.
School of Computer Science, University of Birmingham, Birmingham, UK.
Sci Rep. 2024 Jul 6;14(1):15569. doi: 10.1038/s41598-024-66160-4.
Auditory and visual signals are two primary perception modalities that are usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals-usually speech audio. In this study, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without relying on dense supervisory annotations from human experts for the model training. A simple yet effective multi-modal self-supervised learning framework is presented for this purpose. The proposed approach is able to help find standard anatomical planes, predict the focusing position of sonographer's eyes, and localise anatomical regions of interest during ultrasound imaging. Experimental analysis on a large-scale clinical multi-modal ultrasound video dataset show that the proposed novel representation learning method provides good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions. Being able to learn such medical representations in a self-supervised manner will contribute to several aspects including a better understanding of obstetric imaging, training new sonographers, more effective assistive tools for human experts, and enhancement of the clinical workflow.
听觉和视觉信号是两种主要的感知模式,它们通常同时存在且相互关联,不仅在自然环境中如此,在临床环境中也是如此。然而,由于音频/视频信号的来源不同,以及听觉信号(通常是语音音频)中的噪声(包括信号级和语义级噪声),后者的音频-视觉建模可能更具挑战性。在本研究中,我们考虑了临床环境中的音频-视觉建模,提供了一种无需依赖人类专家的密集监督标注即可学习有益于各种临床任务的医学表示的解决方案。为此,提出了一种简单而有效的多模态自监督学习框架。该方法能够帮助找到标准解剖平面,预测超声医师眼睛的聚焦位置,并在超声成像中定位感兴趣的解剖区域。在大规模的临床多模态超声视频数据集上的实验分析表明,所提出的新颖表示学习方法提供了良好的可迁移解剖表示,可提高自动化下游临床任务的性能,甚至优于完全监督的解决方案。能够以自监督的方式学习这种医学表示,将有助于理解产科成像、培训新的超声医师、为人类专家提供更有效的辅助工具以及增强临床工作流程等多个方面。