Complemental Human Parsing and Attention Guided Network (CHPAN) for Person Re-Identification | Nanyang Technological University | Innovation and Entrepreneurship

Synopsis

This technology aims to mine potentially discriminative information from the background as a complement for foreground semantics of human body parts. By accurately identifying and tracking individuals, augmented reality applications can create interactive and immersive experiences.

Opportunity

We propose a complementary human parsing and attention-guided network (CHPAN) for person re-identification (re-ID), leveraging semantic information on human body parts from both foreground and background elements. In addition, our model can mine the salient region features at the global level by using a newly designed light-weight self-attention module to improve the accuracy of identification. We introduce a new and effective human parsing feature module to extract the useful discriminative information from the background, by suppressing the noise and combining the foreground semantics to enhance the feature representation capability of model. To our knowledge, this is the first method to exploit the segmented background semantic by human parsing for person re-ID. The proposed methodology is useful in cases where there is insufficient information on the person for identification (e.g., due to occlusion), while ancillary information (e.g., backpack) is available to assist better decisions.

Technology

Pixel-level Human Part-Aligned Representation Branch
We introduce a human parsing feature module tailored to acquiring local features of the human body. This module leverages the global feature map generated from the respective stage alongside confidence maps predicted by a human parsing model. Integrating a Human Parsing Feature Module (HPFM) after each stage of the network, we accommodate varying input sizes for each stage, ensuring adaptability and robustness.
Attention-Guide Global Feature Learning Branch
Our proposed attention module includes channel-wise CONCAT attention module (CCAM) and multi-scale spatial attention module (MSAM). CCAM explores the correlation between channel features while MSAM explores features with strong semantics in the spatial dimension.

Figure 1: Framework of our proposed model. The input image undergoes processing through stacked convolutional layers within the backbone network, consisting of four stages, to form a 3D feature map.

Figure 1: Framework of our proposed model. The input image undergoes processing through stacked convolutional layers within the backbone network, consisting of four stages, to form a 3D feature map. The purpose of the attention-guided feature learning (AGFL) branch aims to capture the discriminative features at the global level using the self-attention module. For the part-aware human parsing recognition (PHPR) branch, we utilise human parsing feature mining (HPFM) to get the semantic features of human body parts and mine some valuable visual cues hidden in the background simultaneously. At the end of the network, we design a feature extractor (FE) unit to extract the final representation vector.

Figure 2: These images indicate that our method has better robustness for difficult situations, such as different views, lights and shades.

Figure 2: From left to right, each triplet contains probe image, the Top-1 image of our model and Top-1 image of human semantic parsing for person re-identification (SPReID). These images indicate that our method has better robustness for difficult situations, such as different views, lights and shades.

Applications & Advantages

This technique finds applications in various domains, including public safety for tracking suspects, in retail for discovering shopping patterns, in healthcare for uncovering outpatient movement patterns, and numerous other fields.

The novel HPFM is designed to mine salient features from the background while simultaneously leveraging foreground information. Additionally, another branch guided by an attention module is able to identify potential discriminative regions within images at a global level.