Hu Xiaoling, Liu Chang
Institute of Applied Mathematics, Beijing Information Science & Technology University, Beijing 100101, China.
Animals (Basel). 2024 Jun 7;14(12):1712. doi: 10.3390/ani14121712.
Traditional animal pose estimation techniques based on images face significant hurdles, including scarce training data, costly data annotation, and challenges posed by non-rigid deformation. Addressing these issues, we proposed dynamic conditional prompts for the prior knowledge of animal poses in language modalities. Then, we utilized a multimodal (language-image) collaborative training and contrastive learning model to estimate animal poses. Our method leverages text prompt templates and image feature conditional tokens to construct dynamic conditional prompts that integrate rich linguistic prior knowledge in depth. The text prompts highlight key points and relevant descriptions of animal poses, enhancing their representation in the learning process. Meanwhile, transformed via a fully connected non-linear network, image feature conditional tokens efficiently embed the image features into these prompts. The resultant context vector, derived from the fusion of the text prompt template and the image feature conditional token, generates a dynamic conditional prompt for each input sample. By utilizing a contrastive language-image pre-training model, our approach effectively synchronizes and strengthens the training interactions between image and text features, resulting in an improvement to the precision of key-point localization and overall animal pose estimation accuracy. The experimental results show that language-image contrastive learning based on dynamic conditional prompts enhances the average accuracy of animal pose estimation on the AP-10K and Animal Pose datasets.
基于图像的传统动物姿态估计技术面临重大障碍,包括训练数据稀缺、数据标注成本高以及非刚性变形带来的挑战。为解决这些问题,我们针对语言模态中动物姿态的先验知识提出了动态条件提示。然后,我们利用多模态(语言-图像)协作训练和对比学习模型来估计动物姿态。我们的方法利用文本提示模板和图像特征条件令牌来构建动态条件提示,深度整合丰富的语言先验知识。文本提示突出动物姿态的关键点和相关描述,在学习过程中增强其表示。同时,通过全连接非线性网络进行变换,图像特征条件令牌有效地将图像特征嵌入到这些提示中。由文本提示模板和图像特征条件令牌融合得到的上下文向量为每个输入样本生成动态条件提示。通过使用对比语言-图像预训练模型,我们的方法有效地同步并加强了图像和文本特征之间的训练交互,从而提高了关键点定位的精度和整体动物姿态估计的准确性。实验结果表明,基于动态条件提示的语言-图像对比学习提高了在AP-10K和动物姿态数据集上动物姿态估计的平均准确率。