Alsharid Mohammad, Sharma Harshita, Drukker Lior, Chatelain Pierre, Papageorghiou Aris T, Noble J Alison
University of Oxford, Oxford, UK.
Med Image Comput Comput Assist Interv. 2019 Oct;22:338-346. doi: 10.1007/978-3-030-32251-9_37. Epub 2019 Oct 10.
We describe an automatic natural language processing (NLP)-based image captioning method to describe fetal ultrasound video content by modelling the vocabulary commonly used by sonographers and sonologists. The generated captions are similar to the words spoken by a sonographer when describing the scan experience in terms of visual content and performed scanning actions. Using full-length second-trimester fetal ultrasound videos and text derived from accompanying expert voice-over audio recordings, we train deep learning models consisting of convolutional neural networks and recurrent neural networks in merged configurations to generate captions for ultrasound video frames. We evaluate different model architectures using established general metrics (, ) and application-specific metrics. Results show that the proposed models can learn joint representations of image and text to generate relevant and descriptive captions for anatomies, such as the spine, the abdomen, the heart, and the head, in clinical fetal ultrasound scans.
我们描述了一种基于自动自然语言处理(NLP)的图像字幕方法,通过对超声医师和超声专家常用词汇进行建模来描述胎儿超声视频内容。生成的字幕在视觉内容和执行的扫描动作方面与超声医师描述扫描过程时所说的话相似。利用孕中期完整的胎儿超声视频以及伴随的专家旁白音频记录得出的文本,我们训练了由卷积神经网络和循环神经网络组成的深度学习模型,这些模型以合并配置为超声视频帧生成字幕。我们使用既定的通用指标(,)和特定于应用的指标来评估不同的模型架构。结果表明,所提出的模型可以学习图像和文本的联合表示,以便为临床胎儿超声扫描中的解剖结构(如脊柱、腹部、心脏和头部)生成相关且具有描述性的字幕。