IEEE Trans Pattern Anal Mach Intell. 2017 Nov;39(11):2242-2255. doi: 10.1109/TPAMI.2016.2636827. Epub 2016 Dec 7.
In real-world visual recognition problems, low-level features cannot adequately characterize the semantic content in images, or the spatio-temporal structure in videos. In this work, we encode objects or actions based on attributes that describe them as high-level concepts. We consider two types of attributes. One type of attributes is generated by humans, while the second type is data-driven attributes extracted from data using dictionary learning methods. Attribute-based representation may exhibit variations due to noisy and redundant attributes. We propose a discriminative and compact attribute-based representation by selecting a subset of discriminative attributes from a large attribute set. Three attribute selection criteria are proposed and formulated as a submodular optimization problem. A greedy optimization algorithm is presented and its solution is guaranteed to be at least (1-1/e)-approximation to the optimum. Experimental results on four public datasets demonstrate that the proposed attribute-based representation significantly boosts the performance of visual recognition and outperforms most recently proposed recognition approaches.
在真实世界的视觉识别问题中,底层特征无法充分描述图像中的语义内容或视频中的时空结构。在这项工作中,我们基于描述对象或动作的属性对其进行编码,这些属性将其表示为高级概念。我们考虑了两种类型的属性。一种类型的属性是由人类生成的,而第二种类型的属性是使用字典学习方法从数据中提取的数据驱动属性。基于属性的表示可能会因噪声和冗余属性而发生变化。我们通过从大型属性集中选择一组具有判别力的属性来提出一种具有判别力且紧凑的基于属性的表示。提出了三个属性选择标准,并将其表示为一个次模优化问题。提出了一种贪心优化算法,其解决方案至少是最优解的 (1-1/e) 逼近。在四个公共数据集上的实验结果表明,所提出的基于属性的表示显著提高了视觉识别的性能,并且优于最近提出的识别方法。