Yang Wenfei, Zhang Tianzhu, Zhang Yongdong, Wu Feng
IEEE Trans Image Process. 2021;30:3252-3262. doi: 10.1109/TIP.2021.3058614. Epub 2021 Mar 2.
Weakly supervised temporal sentence grounding has better scalability and practicability than fully supervised methods in real-world application scenarios. However, most of existing methods cannot model the fine-grained video-text local correspondences well and do not have effective supervision information for correspondence learning, thus yielding unsatisfying performance. To address the above issues, we propose an end-to-end Local Correspondence Network (LCNet) for weakly supervised temporal sentence grounding. The proposed LCNet enjoys several merits. First, we represent video and text features in a hierarchical manner to model the fine-grained video-text correspondences. Second, we design a self-supervised cycle-consistent loss as a learning guidance for video and text matching. To the best of our knowledge, this is the first work to fully explore the fine-grained correspondences between video and text for temporal sentence grounding by using self-supervised learning. Extensive experimental results on two benchmark datasets demonstrate that the proposed LCNet significantly outperforms existing weakly supervised methods.
在实际应用场景中,弱监督时间句子定位比全监督方法具有更好的可扩展性和实用性。然而,现有的大多数方法都不能很好地对细粒度的视频-文本局部对应关系进行建模,并且没有用于对应关系学习的有效监督信息,因此性能不尽人意。为了解决上述问题,我们提出了一种用于弱监督时间句子定位的端到端局部对应网络(LCNet)。所提出的LCNet具有几个优点。首先,我们以分层方式表示视频和文本特征,以对细粒度的视频-文本对应关系进行建模。其次,我们设计了一种自监督循环一致损失作为视频和文本匹配的学习指导。据我们所知,这是第一项通过使用自监督学习来充分探索视频和文本之间的细粒度对应关系以进行时间句子定位的工作。在两个基准数据集上的大量实验结果表明,所提出的LCNet明显优于现有的弱监督方法。