Gridach Mourad, Alsharid Mohammad, Jiao Jianbo, Drukker Lior, Papageorghiou Aris T, Noble J Alison
University of Oxford.
Khalifa University.
Proc IEEE Int Symp Biomed Imaging. 2024 May 27;2024:1-4. doi: 10.1109/ISBI56570.2024.10635693.
This paper tackles the challenging problem of real-world data self-supervised representation learning from two modalities: fetal ultrasound (US) video and the corresponding speech acquired when a sonographer performs a pregnancy scan. We propose to transfer knowledge between the different modalities, even though the sonographer's speech and the US video may not be semantically correlated. We design a network architecture capable of learning useful representations such as of anatomical features and structures while recognising the correlation between an US video scan and the sonographer's speech. We introduce dual representation learning from US video and audio, which consists of two concepts: Multi-Modal Contrastive Learning and Multi-Modal Similarity Learning, in a latent feature space. Experiments show that the proposed architecture learns powerful representations and transfers well for two downstream tasks. Furthermore, we experiment with two different datasets for pretraining which differ in size and length of video clips (as well as sonographer speech) to show that the quality of the sonographer's speech plays an important role in the final performance.
胎儿超声(US)视频以及超声检查医师进行妊娠扫描时获取的相应语音。我们建议在不同模态之间传递知识,即便超声检查医师的语音与超声视频在语义上可能并无关联。我们设计了一种网络架构,该架构能够在识别超声视频扫描与超声检查医师语音之间的相关性的同时,学习诸如解剖特征和结构等有用的表示。我们引入了来自超声视频和音频的双重表示学习,它在潜在特征空间中由两个概念组成:多模态对比学习和多模态相似性学习。实验表明,所提出的架构学习到了强大的表示,并且在两个下游任务中具有良好的迁移能力。此外,我们使用两个不同的数据集进行预训练,这两个数据集在视频片段(以及超声检查医师的语音)的大小和长度方面存在差异,以表明超声检查医师语音的质量在最终性能中起着重要作用。