The Product Search team deals with the topic of visual fashion analysis for handbags and shoes. More specifically, we focus on the problems of branded handbag recognition and cross-scenario shoe retrieval. The researchers devote efforts in exploring the prior knowledge of these two types of products and capturing the fine details from the product image, to design search algorithms that perform favorably in recognizing/retrieving these products.
In this problem, our team is focusing on branded handbags, such as Louis Vuitton and Coach. The challenges of branded handbag recognition lie are: 1) Inter-class style similarity. The styles of some handbag are very similar, only small decorations on local parts or subtle texture differences show discriminability; 2) Intra-class color variation. The illumination changes enlarge the intra-class color variance of handbag models within the same model number.
To deal with the aforementioned two challenges and more effectively capture and differentiate different handbag models, the group has developed several algorithms to develop discriminative representations of handbag style and color: 1) propose the complementary feature to effectively capture the texture details; 2) develop a novel supervised discriminative patch selection strategy; 3) design a dominant color feature element selection method to handle the illumination changes. Also, the recognition pipeline is a sequential process, with the recognition of the handbag styles comes first followed by the recognition of the specific models with the identified style. The following shows the schematic pipeline for our branded handbag recognition system.
For evaluating the proposed handbag recognition algorithm, we also established a branded handbag dataset with 5545 images classified into 220 different models. Each handbag image in our dataset is annotated with a bounding box covering the handbag surface. Also, our handbag dataset consists of 125 different styles and the number of handbags per style is in the range of [1, 13]. The experimental results performed on the dataset show that our method performs favorably in recognizing handbags, achieving 90.86% accuracy in recognizing the handbag and with around 10% improvement in accuracy when compared with the existing fine-grained or generic object recognition methods.
In this problem, our aim is to find exactly the same shoes from the online shop (shop scenario), given a daily shoe photo (street scenario). Designing such a system is highly challenging in three aspects: I) cross-domain differences. 2) subtle differences in appearances; 3) viewpoint variation for the same shoe item. To deal with the aforementioned challenges, we need to learn an efficient feature embedding to 1) reduce the feature distance between the same shoes in different scenarios, 2) distinguish fine-grained differences, and 3) reduce ambiguous feature representation from different views.
We address the significant visual differences of these two scenarios in the triplet-based convolutional neural network (CNN). Specifically, we propose 1) the weighted triplet loss to reduce the feature distance between the same shoe in different scenarios; 2) attribute-based hard example mining process to distinguish fine-grained differences; 3) a novel viewpoint invariant loss to reduce ambiguous feature representation from different views.
To evaluate the proposed algorithm, we collect a novel large-scale multi-view shoe dataset with online-offline image pairs for training and evaluation of our proposed approach. Eventually, we collected about 9800 and 31050 images from the street and online shop scenarios in multiple viewpoints with annotated semantic attributes. The experimental results performed on the dataset show that our method performs favorably in retrieving the online store shoe images given the daily life shoe photo, achieving 73.26% accuracy and significant improvement in accuracy compared with the pre-trained CNN features. The experimental results show the effectiveness of each step of our proposed method. Moreover, the qualitative results show that our proposed method is capable of dealing with the query image with un-common viewpoints, such as rear view, or up-facing view.
. Yan Wang, Sheng Li, and Alex C. Kot, "On Branded Handbag Recognition", in IEEE Transactions on Multimedia (TMM), 2016
. Huijing Zhan, Boxin Shi, Alex C. Kot, “Street-to-Shop Shoe Retrieval”, accepted in British Machine Vision Conference (BMVC), 2017