Xiang Yi, Acharya Rajendra, Le Quan, Tan Jen Hong, Chng Chiaw-Ling
Office of Insights & Analytics, Division of Digital Strategy, SingHealth, Singapore, Singapore.
School of Mathematics, Physics and Computing, University of Southern Queensland, Springfield Central, QLD, Australia.
Front Artif Intell. 2025 Jul 24;8:1618426. doi: 10.3389/frai.2025.1618426. eCollection 2025.
Thyroid nodule segmentation in ultrasound (US) images is a valuable yet challenging task, playing a critical role in diagnosing thyroid cancer. The difficulty arises from factors such as the absence of prior knowledge about the thyroid region, low contrast between anatomical structures, and speckle noise, all of which obscure boundary detection and introduce variability in nodule appearance across different images.
To address these challenges, we propose a transformer-based model for thyroid nodule segmentation. Unlike traditional convolutional neural networks (CNNs), transformers capture global context from the first layer, enabling more comprehensive image representation, which is crucial for identifying subtle nodule boundaries. In this study, We first pre-train a Masked Autoencoder (MAE) to reconstruct masked patches, then fine-tune on thyroid US data, and further explore a cross-attention mechanism to enhance information flow between encoder and decoder.
Our experiments on the public AIMI, TN3K, and DDTI datasets show that MAE pre-training accelerates convergence. However, overall improvements are modest: the model achieves Dice Similarity Coefficient (DSC) scores of 0.63, 0.64, and 0.65 on AIMI, TN3K, and DDTI, respectively, highlighting limitations under small-sample conditions. Furthermore, adding cross-attention did not yield consistent gains, suggesting that data volume and diversity may be more critical than additional architectural complexity.
MAE pre-training notably reduces training time and helps themodel learn transferable features, yet overall accuracy remains constrained by limited data and nodule variability. Future work will focus on scaling up data, pre-training cross-attention layers, and exploring hybrid architectures to further boost segmentation performance.
超声(US)图像中的甲状腺结节分割是一项有价值但具有挑战性的任务,在甲状腺癌诊断中起着关键作用。困难源于诸如缺乏关于甲状腺区域的先验知识、解剖结构之间的低对比度以及斑点噪声等因素,所有这些都会模糊边界检测,并在不同图像中引入结节外观的变异性。
为应对这些挑战,我们提出了一种基于Transformer的甲状腺结节分割模型。与传统卷积神经网络(CNN)不同,Transformer从第一层开始捕捉全局上下文,能够实现更全面的图像表示,这对于识别细微的结节边界至关重要。在本研究中,我们首先预训练一个掩码自动编码器(MAE)以重建掩码补丁,然后在甲状腺超声数据上进行微调,并进一步探索交叉注意力机制以增强编码器和解码器之间的信息流。
我们在公共AIMI、TN3K和DDTI数据集上的实验表明,MAE预训练加速了收敛。然而,总体改进幅度不大:该模型在AIMI、TN3K和DDTI上分别实现了0.63、0.64和0.65的骰子相似系数(DSC)分数,突出了小样本条件下的局限性。此外,添加交叉注意力并没有带来一致的收益,这表明数据量和多样性可能比额外的架构复杂性更关键。
MAE预训练显著减少了训练时间,并帮助模型学习可迁移特征,但总体准确性仍然受到有限数据和结节变异性的限制。未来的工作将集中在扩大数据规模、预训练交叉注意力层以及探索混合架构以进一步提高分割性能。