Person re-identification (or Person Re-ID for short) is defined as the problem of matching people across disjoint camera views in a multi-camera system. It is useful for a number of public security applications such as intelligent camera surveillance systems. In a typical real-world application, one single person, or a watch-list of a handful of known people, is provided as the target set for searching through a large volume of video surveillance footage where the people on the watch-list are likely to re-appear. Given a person seen in one camera, the aim is to re-identify that person in another camera in that camera network based on the visual appearance of that individual, as shown in Figure 1.
Regardless of the increasing attention received from both the academic and the industry world, person re-identification remains an extremely challenging task, especially in practical environments. This is due to a list of reasons including: (1) the target and the person in the search space have different views (frontal view, side view, back view, etc.) due to different angles and distances between the camera and the persons seen by those cameras. (2) the target people is usually captured via a very low frame rate which is typical in most of the existing recorded public space CCTV video footages in a (3) very crowded place such as the exit of a subway station with (4) many occlusions and the target is visible only from time to time. In addition, (5) the human detection algorithm applied to the surveillance video may not perform perfectly, especially in real-time and a great number of non-human objects could be mistakenly detected, adding disruptive inputs of the Person Re-ID system. (6) Real-world Person Re-ID is an open set problem, meaning that there is an unlimited number of classes (number of persons with different identities). Typical classification methods with a limited number of trained classes do not work.
To address these difficulties, many fully supervised Person Re-ID approaches have been proposed in recent years. The performance of fully supervised Person Re-ID has been much improved by using sophisticated training methods on a single labeled dataset. However, these models trained on a single dataset usually suffer from considerable performance degradation when applying to videos of a different camera network.
To be practical, a Person Re-ID model pre-trained on the datasets should start running immediately after deployment on a new site without having to wait until sufficient images or videos are collected and the pre-trained model is tuned. To serve this purpose, we reformulate the Person Re-ID problem as a multi-dataset domain generalization problem. The NTU ROSE Lab in collaboration with the University of Warwick proposed a novel framework for domain generalization, which aims to learn a universal representation via domain-based adversarial learning while aligning the distribution of mid-level features between them. Our proposed framework can be considered as an extension of our previous Multi-task Mid-level Feature Alignment (MMFA) network in a multiple domain learning setting. We called it MMFA with Adversarial Auto-Encoder (MMFA-AAE).
Our MMFA-AAE can simultaneously minimize the losses of data reconstruction, identity classification, and triplet verification loss. It alleviates the domain difference via adversarial training and also matches the distribution of the mid-level features across multiple datasets. Our MMFA-AAE approach not only outperforms most of the domain generalization Person Re-ID methods but also surpasses many state-of-the-art supervised methods and unsupervised domain adaptation methods by a large margin, shown in Table 1.
Overall, unlike many other supervised or domain adaptation Person Re-ID models, MMFA-AAE can work on any unseen surveillance camera network without any additional training or fine-tuning. It provides a well-generalized feature representation with a usable performance for real-world surveillance applications.
The MMFA-AAE network is a collaboration work with the University of Warwick through the EU IDENTITY project (EU Horizon 2020 Marie Sklodowska-Curie Actions through the project entitled Computer Vision Enabled Multimedia Forensics and People Identification, Project No. 690907, Acronym: IDENTITY). The IDENTITY project aims to integrate multimedia forensics into forensic science. Multimedia forensics is concerned with the development of scientific methods to extract, analyse and categorize digital evidence derived from multimedia sources, such as imaging devices. For example, developing technologies to identify, categorise and classify the source of images and video, to authenticate and verify the integrity of their content, as well as to re-identify a person across cameras.
Based on the newly developed MMFA-AAE model, the ROSE Lab also developed a web-based AI-powered surveillance system that served as the demo for this project. This system is well integrated with the 175 surveillance cameras in the NTU EEE building and processes and analyzes the video feeds in real-time. This system consists of two main functions: trajectory tracking retrieval and real-time person matching. The trajectory tracking retrieval aims to find the person of interest (POI) in all cameras and plot the historical movement trajectory of the POL within the EEE building, as shown in Figure 3. The real-time person matching aims to match the POI in real-time surveillance cameras and raise the warning to the surveillance officers, shown in Figure 4.
Figure 3: Trajectory tracking retrieval example.
Figure 4: Real-time matching example.
The ROSE Re-ID system is based on the Flask micro web framework. It can be easily modified and integrated into any surveillance network using the RTSP or HTTP video streams. Hence, during the COVID-19 pandemic, this system has been modified and deployed in the foreign worker isolation facilities to enhance security, as shown in Figure 5.
C. P. Tay, S. Roy, and K. H. Yap, “AANet: Attribute Attention Network for Person Re-Identifications,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7127–7136. (https://ieeexplore.ieee.org/document/8954103/)
F. Yang, K. Yan, S. Lu, H. Jia, X. Xie, and W. Gao, “Attention Driven Person Re-identification,” Pattern Recognit., 2019. (https://arxiv.org/abs/1810.05866)
S. Lin, H. Li, C.-T. Li, and A. C. Kot, “Multi-task Mid-level Feature Alignment Network for Unsupervised Cross-Dataset Person Re-Identification,” in Proc. British Machine Vision Conference (BMVC), 2018. (https://arxiv.org/abs/1807.01440)
Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C. Kot, Gang Wang, “Dual Attention Matching Network for Context-Aware Feature Sequence Based Person Re-identification,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5363–5372. ((https://arxiv.org/abs/1803.09937)