临床环境中的视听建模。

Audio-visual modelling in a clinical setting.

机构信息

Department of Engineering Science, University of Oxford, Oxford, UK.

School of Computer Science, University of Birmingham, Birmingham, UK.

出版信息

Sci Rep. 2024 Jul 6;14(1):15569. doi: 10.1038/s41598-024-66160-4.

DOI:10.1038/s41598-024-66160-4

PMID:38971838

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11227581/

Abstract

Auditory and visual signals are two primary perception modalities that are usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals-usually speech audio. In this study, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without relying on dense supervisory annotations from human experts for the model training. A simple yet effective multi-modal self-supervised learning framework is presented for this purpose. The proposed approach is able to help find standard anatomical planes, predict the focusing position of sonographer's eyes, and localise anatomical regions of interest during ultrasound imaging. Experimental analysis on a large-scale clinical multi-modal ultrasound video dataset show that the proposed novel representation learning method provides good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions. Being able to learn such medical representations in a self-supervised manner will contribute to several aspects including a better understanding of obstetric imaging, training new sonographers, more effective assistive tools for human experts, and enhancement of the clinical workflow.

摘要

听觉和视觉信号是两种主要的感知模式，它们通常同时存在且相互关联，不仅在自然环境中如此，在临床环境中也是如此。然而，由于音频/视频信号的来源不同，以及听觉信号（通常是语音音频）中的噪声（包括信号级和语义级噪声），后者的音频-视觉建模可能更具挑战性。在本研究中，我们考虑了临床环境中的音频-视觉建模，提供了一种无需依赖人类专家的密集监督标注即可学习有益于各种临床任务的医学表示的解决方案。为此，提出了一种简单而有效的多模态自监督学习框架。该方法能够帮助找到标准解剖平面，预测超声医师眼睛的聚焦位置，并在超声成像中定位感兴趣的解剖区域。在大规模的临床多模态超声视频数据集上的实验分析表明，所提出的新颖表示学习方法提供了良好的可迁移解剖表示，可提高自动化下游临床任务的性能，甚至优于完全监督的解决方案。能够以自监督的方式学习这种医学表示，将有助于理解产科成像、培训新的超声医师、为人类专家提供更有效的辅助工具以及增强临床工作流程等多个方面。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f428/11227581/2c4fc8ad40a8/41598_2024_66160_Fig1_HTML.jpg

相似文献

Audio-visual modelling in a clinical setting.

Sci Rep. 2024 Jul 6;14(1):15569. doi: 10.1038/s41598-024-66160-4.

Self-supervised Contrastive Video-Speech Representation Learning for Ultrasound.

Med Image Comput Comput Assist Interv. 2020 Oct;12263:534-543. doi: 10.1007/978-3-030-59716-0_51.

Transforming obstetric ultrasound into data science using eye tracking, voice recording, transducer motion and ultrasound video.

Sci Rep. 2021 Jul 8;11(1):14109. doi: 10.1038/s41598-021-92829-1.

Self-Supervised Representation Learning for Ultrasound Video.

Proc IEEE Int Symp Biomed Imaging. 2020 Apr 3;2020:1847-1850. doi: 10.1109/ISBI45749.2020.9098666.

Unsupervised Modality-Transferable Video Highlight Detection With Representation Activation Sequence Learning.

IEEE Trans Image Process. 2024;33:1911-1922. doi: 10.1109/TIP.2024.3372469. Epub 2024 Mar 12.

Knowledge representation and learning of operator clinical workflow from full-length routine fetal ultrasound scan videos.

Med Image Anal. 2021 Apr;69:101973. doi: 10.1016/j.media.2021.101973. Epub 2021 Jan 23.

Multimodal semi-supervised learning for online recognition of multi-granularity surgical workflows.

Int J Comput Assist Radiol Surg. 2024 Jun;19(6):1075-1083. doi: 10.1007/s11548-024-03101-6. Epub 2024 Apr 1.

Semantic and Relation Modulation for Audio-Visual Event Localization.

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7711-7725. doi: 10.1109/TPAMI.2022.3226328. Epub 2023 May 5.

Multimodal interaction enhanced representation learning for video emotion recognition.

Front Neurosci. 2022 Dec 19;16:1086380. doi: 10.3389/fnins.2022.1086380. eCollection 2022.

A modality-collaborative convolution and transformer hybrid network for unpaired multi-modal medical image segmentation with limited annotations.

Med Phys. 2023 Sep;50(9):5460-5478. doi: 10.1002/mp.16338. Epub 2023 Mar 15.

本文引用的文献

Anatomy-Aware Contrastive Representation Learning for Fetal Ultrasound.

Comput Vis ECCV. 2022 Oct;2022:422-436. doi: 10.1007/978-3-031-25066-8_23.

Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks.

Med Image Anal. 2022 Nov;82:102630. doi: 10.1016/j.media.2022.102630. Epub 2022 Sep 17.

Transforming obstetric ultrasound into data science using eye tracking, voice recording, transducer motion and ultrasound video.

Sci Rep. 2021 Jul 8;11(1):14109. doi: 10.1038/s41598-021-92829-1.

Self-supervised Contrastive Video-Speech Representation Learning for Ultrasound.

Med Image Comput Comput Assist Interv. 2020 Oct;12263:534-543. doi: 10.1007/978-3-030-59716-0_51.

Self-Supervised Representation Learning for Ultrasound Video.

Proc IEEE Int Symp Biomed Imaging. 2020 Apr 3;2020:1847-1850. doi: 10.1109/ISBI45749.2020.9098666.

Ultrasound Image Representation Learning by Modeling Sonographer Visual Attention.

Inf Process Med Imaging. 2019 Jun;26:592-604. doi: 10.1007/978-3-030-20351-1_46. Epub 2019 May 22.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Multi-task SonoEyeNet: Detection of Fetal Standardized Planes Assisted by Generated Sonographer Attention Maps.

Med Image Comput Comput Assist Interv. 2018 Sep;11070:871-879. doi: 10.1007/978-3-030-00928-1_98. Epub 2018 Sep 26.

What Do Different Evaluation Metrics Tell Us About Saliency Models?

IEEE Trans Pattern Anal Mach Intell. 2019 Mar;41(3):740-757. doi: 10.1109/TPAMI.2018.2815601. Epub 2018 Mar 13.

SonoNet: Real-Time Detection and Localisation of Fetal Standard Scan Planes in Freehand Ultrasound.

IEEE Trans Med Imaging. 2017 Nov;36(11):2204-2215. doi: 10.1109/TMI.2017.2712367. Epub 2017 Jul 11.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

临床环境中的视听建模。

Audio-visual modelling in a clinical setting.

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献