Li Jian, Wei Yanan, Ma Wenkai, Wang Tan
School of Biology and Food Engineering, Fuyang Normal University, Fuyang 236037, China.
Provincial Key Laboratory of Embryo Development and Reproductive Regulation, Fuyang 236037, China.
Animals (Basel). 2025 Jul 31;15(15):2245. doi: 10.3390/ani15152245.
Accurate evaluation of fish feeding intensity is crucial for optimizing aquaculture efficiency and the healthy growth of fish. Previous methods mainly rely on single-modal approaches (e.g., audio or visual). However, the complex underwater environment makes single-modal monitoring methods face significant challenges: visual systems are severely affected by water turbidity, lighting conditions, and fish occlusion, while acoustic systems suffer from background noise. Although existing studies have attempted to combine acoustic and visual information, most adopt simple feature-level fusion strategies, which fail to fully explore the complementary advantages of the two modalities under different environmental conditions and lack dynamic evaluation mechanisms for modal reliability. To address these problems, we propose the Adaptive Cross-modal Attention Fusion Network (ACAF-Net), a cross-modal complementarity learning framework with a two-stage attention fusion mechanism: (1) a cross-modal enhancement stage that enriches individual representations through Low-rank Bilinear Pooling and learnable fusion weights; (2) an adaptive attention fusion stage that dynamically weights acoustic and visual features based on complementarity and environmental reliability. Our framework incorporates dimension alignment strategies and attention mechanisms to capture temporal-spatial complementarity between acoustic feeding signals and visual behavioral patterns. Extensive experiments demonstrate superior performance compared to single-modal and conventional fusion approaches, with 6.4% accuracy improvement. The results validate the effectiveness of exploiting cross-modal complementarity for underwater behavioral analysis and establish a foundation for intelligent aquaculture monitoring systems.
准确评估鱼类摄食强度对于优化水产养殖效率和鱼类健康生长至关重要。以往的方法主要依赖单模态方法(如音频或视觉)。然而,复杂的水下环境使单模态监测方法面临重大挑战:视觉系统受到水体浑浊度、光照条件和鱼类遮挡的严重影响,而声学系统则受到背景噪声的困扰。尽管现有研究试图将声学和视觉信息结合起来,但大多数采用简单的特征级融合策略,未能充分探索两种模态在不同环境条件下的互补优势,并且缺乏模态可靠性的动态评估机制。为了解决这些问题,我们提出了自适应跨模态注意力融合网络(ACAF-Net),这是一个具有两阶段注意力融合机制的跨模态互补学习框架:(1)跨模态增强阶段,通过低秩双线性池化和可学习的融合权重丰富个体表示;(2)自适应注意力融合阶段,基于互补性和环境可靠性对声学和视觉特征进行动态加权。我们的框架结合了维度对齐策略和注意力机制,以捕捉声学摄食信号和视觉行为模式之间的时空互补性。大量实验表明,与单模态和传统融合方法相比,该方法具有卓越的性能,准确率提高了6.4%。结果验证了利用跨模态互补性进行水下行为分析的有效性,并为智能水产养殖监测系统奠定了基础。