Zhang Jinnian, Chen Weijie, Joshi Tanmayee, Zhang Xiaomin, Loh Po-Ling, Jog Varun, Bruce Richard J, Garrett John W, McMillan Alan B
Department of Electrical and Computer Engineering, University of Wisconsin-Madison, Madison, WI 53706, USA.
Department of Computer Science, University of Wisconsin-Madison, Madison, WI 53706, USA.
Tomography. 2024 Dec 13;10(12):2058-2072. doi: 10.3390/tomography10120146.
This research introduces BAE-ViT, a specialized vision transformer model developed for bone age estimation (BAE). This model is designed to efficiently merge image and sex data, a capability not present in traditional convolutional neural networks (CNNs). BAE-ViT employs a novel data fusion method to facilitate detailed interactions between visual and non-visual data by tokenizing non-visual information and concatenating all tokens (visual or non-visual) as the input to the model. The model underwent training on a large-scale dataset from the 2017 RSNA Pediatric Bone Age Machine Learning Challenge, where it exhibited commendable performance, particularly excelling in handling image distortions compared to existing models. The effectiveness of BAE-ViT was further affirmed through statistical analysis, demonstrating a strong correlation with the actual ground-truth labels. This study contributes to the field by showcasing the potential of vision transformers as a viable option for integrating multimodal data in medical imaging applications, specifically emphasizing their capacity to incorporate non-visual elements like sex information into the framework. This tokenization method not only demonstrates superior performance in this specific task but also offers a versatile framework for integrating multimodal data in medical imaging applications.
本研究介绍了BAE-ViT,这是一种专门为骨龄估计(BAE)开发的视觉Transformer模型。该模型旨在有效融合图像和性别数据,这是传统卷积神经网络(CNN)所不具备的能力。BAE-ViT采用了一种新颖的数据融合方法,通过对非视觉信息进行编码并将所有编码(视觉或非视觉)连接起来作为模型的输入,来促进视觉和非视觉数据之间的详细交互。该模型在2017年RSNA儿科骨龄机器学习挑战赛的大规模数据集上进行了训练,在该数据集中它表现出了值得称赞的性能,特别是与现有模型相比,在处理图像失真方面表现出色。通过统计分析进一步证实了BAE-ViT的有效性,表明它与实际的真实标签有很强的相关性。这项研究通过展示视觉Transformer在医学成像应用中集成多模态数据的可行性,特别是强调其将性别信息等非视觉元素纳入框架的能力,为该领域做出了贡献。这种编码方法不仅在这个特定任务中表现出卓越的性能,而且为医学成像应用中集成多模态数据提供了一个通用框架。