Rajendran Praveenbalaji, Chen Yizheng, Qiu Liang, Niedermayr Thomas, Liu Wu, Buyyounouski Mark, Bagshaw Hilary, Han Bin, Yang Yong, Kovalchuk Nataliya, Gu Xuejun, Hancock Steven, Xing Lei, Dai Xianjin
Department of Radiation Oncology, Stanford University, Stanford, California.
Department of Radiation Oncology, Stanford University, Stanford, California.
Int J Radiat Oncol Biol Phys. 2025 Jan 1;121(1):230-240. doi: 10.1016/j.ijrobp.2024.07.2149. Epub 2024 Aug 6.
Artificial intelligence-aided methods have made significant progress in the auto-delineation of normal tissues. However, these approaches struggle with the auto-contouring of radiation therapy target volume. Our goal was to model the delineation of target volume as a clinical decision-making problem, resolved by leveraging large language model-aided multimodal learning approaches.
A vision-language model, termed Medformer, has been developed, employing the hierarchical vision transformer as its backbone and incorporating large language models to extract text-rich features. The contextually embedded linguistic features are seamlessly integrated into visual features for language-aware visual encoding through the visual language attention module. Metrics, including Dice similarity coefficient (DSC), intersection over union (IOU), and 95th percentile Hausdorff distance (HD95), were used to quantitatively evaluate the performance of our model. The evaluation was conducted on an in-house prostate cancer data set and a public oropharyngeal carcinoma data set, totaling 668 subjects.
Our Medformer achieved a DSC of 0.81 ± 0.10 versus 0.72 ± 0.10, IOU of 0.73 ± 0.12 versus 0.65 ± 0.09, and HD95 of 9.86 ± 9.77 mm versus 19.13 ± 12.96 mm for delineation of gross tumor volume on the prostate cancer dataset. Similarly, on the oropharyngeal carcinoma dataset, it achieved a DSC of 0.77 ± 0.11 versus 0.72 ± 0.09, IOU of 0.70 ± 0.09 versus 0.65 ± 0.07, and HD95 of 7.52 ± 4.8 mm versus 13.63 ± 7.13 mm, representing significant improvements (P < 0.05). For delineating the clinical target volume, Medformer achieved a DSC of 0.91 ± 0.04, IOU of 0.85 ± 0.05, and HD95 of 2.98 ± 1.60 mm, comparable with other state-of-the-art algorithms.
Auto-delineation of the treatment target based on multimodal learning outperforms conventional approaches that rely purely on visual features. Our method could be adopted into routine practice to rapidly contour clinical target volume/gross tumor volume.
人工智能辅助方法在正常组织的自动勾画方面取得了显著进展。然而,这些方法在放射治疗靶区体积的自动轮廓勾画方面存在困难。我们的目标是将靶区体积的勾画建模为一个临床决策问题,通过利用大语言模型辅助的多模态学习方法来解决。
我们开发了一种名为Medformer的视觉语言模型,它以分层视觉变换器为骨干,并结合大语言模型来提取富含文本的特征。通过视觉语言注意力模块,将上下文嵌入的语言特征无缝集成到视觉特征中,以进行语言感知的视觉编码。使用包括骰子相似系数(DSC)、交并比(IOU)和第95百分位数豪斯多夫距离(HD95)在内的指标来定量评估我们模型的性能。评估是在一个内部前列腺癌数据集和一个公共口咽癌数据集上进行的,共有668名受试者。
在前列腺癌数据集上,我们的Medformer在勾画大体肿瘤体积时,DSC为0.81±0.10,而之前为0.72±0.10;IOU为0.73±0.12,而之前为0.65±0.09;HD95为9.86±9.77毫米,而之前为19.13±12.96毫米。同样,在口咽癌数据集上,它的DSC为0.77±0.11,而之前为0.72±0.09;IOU为0.70±0.09,而之前为0.65±0.07;HD95为7.52±4.8毫米,而之前为13.63±7.13毫米,均有显著改善(P<0.05)。对于勾画临床靶区体积,Medformer的DSC为0.91±0.04,IOU为0.85±0.05,HD95为2.98±1.60毫米,与其他先进算法相当。
基于多模态学习的治疗靶区自动勾画优于单纯依赖视觉特征的传统方法。我们 的方法可应用于常规实践,以快速勾画临床靶区体积/大体肿瘤体积。