Zhang Wenjie, He Changjun, Cao Yinghan, Xu Shiyun, Wang Mingjiang
Key Laboratory for Key Technologies of IoT Terminals, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, China.
Sensors (Basel). 2025 Mar 13;25(6):1790. doi: 10.3390/s25061790.
Binaural audio is crucial for creating immersive auditory experiences. However, due to the high cost and technical complexity of capturing binaural audio in real-world environments, there has been increasing interest in synthesizing binaural audio from monaural sources. In this paper, we propose a two-stage framework for binaural audio synthesis. Specifically, monaural audio is initially transformed into a preliminary binaural signal, and the shared common portion across the left and right channels, as well as the distinct differential portion in each channel, are extracted. Subsequently, the POS-ORI self-attention module (POSA) is introduced to integrate spatial information of the sound sources and capture their motion. Based on this representation, the common and differential components are separately reconstructed. The gated-convolutional fusion module (GCFM) is then employed to combine the reconstructed components and generate the final binaural audio. Experimental results demonstrate that the proposed method can accurately synthesize binaural audio and achieves state-of-the-art performance in phase estimation (Phase-l2: 0.789, Wave-l2: 0.147, Amplitude-l2: 0.036).