Análise Comportamental em Ambientes de Video-Vigilância Usando Arquiteturas de Aprendizagem Profunda

Publications

How Can Deep Learning Aid Human Behavior Analysis?

Publication . Roxo, Tiago Filipe Dias dos Santos ; Proença, Hugo Pedro Martins Carriço; Inácio, Pedro Ricardo Morais

With the increase of available surveillance data and robustness of state-of-the-art deep learning models, various recent research topics focus on human biometric assessment, tracking and person re-identification. However, one other area of work not extensively explored that can combine surveillance and visual-based models is assessing human behavior. The lack of work in this topic is not surprising given the inherent difficulties on categorizing human behavior in such conditions, in particular without subject cooperation. Based on the psychology literature, human behavior analysis typically requires controlled experimental environments, with subject cooperation and assessing features via grid-based survey. As such, it is not clear on how deep learning models can aid psychology experts in human behavior analysis, which is where this thesis intents to contribute to the body of knowledge. We extensively review psychology literature to define a set of features that have been proven as influential towards human behavior and that can be assessed via camera in surveillance-like conditions. This way, we define human behavior via subject profiling using seven behavioral features: interaction, relative position, clothing, soft biometrics, subject proximity, pose, and use of handheld devices. Note that this analysis does not categorize human behavior into specific states (e.g. aggressive, depressive) but rather creates a set of features that can be used to profile subjects, usable to aid/complement behavioral experts and to compare behavioral traits between subjects in a scene. Furthermore, to motivate the development of works in these areas, we review state-of-the-art approaches and datasets to highlight the limitation of certain areas and discuss the topics worth exploring for future works. After defining a set of behavioral features, we start by exploring the limitation of current biometric models in surveillance conditions, in particular the resilience of gender inference approaches. We demonstrate that these models underperform in surveillance-like data, using PAR datasets, highlighting the limitations of training in cooperative settings to perform in wilder conditions. Supported by the findings of our initial experiments, complementing face and body information arouse as a viable strategy to increase model robustness in these conditions, which lead us to design and propose a new model for wild gender inference based on this premise. This way, we extend the knowledge of an extensive discussed literature topic (gender classification) by exploring its application in settings where current models do not typically perform (surveillance). We also explore the topic of human interaction, namely Active Speaker Detection, in particular towards more uncooperative scenarios such as surveillance conditions. Contrary to the gender/biometrics topic, this is a lesser explored area where works are mainly based on assessing active speakers via face and audio information in cooperative conditions and with good audio and image quality (movie settings). As such, to clearly demonstrate the limitations of state-of-the-art ASD models we start by creating a wilder ASD dataset (WASD), composed of different categories with increasing challenges towards ASD, namely with audio and image quality degradation, and containing uncooperative subjects. This dataset highlighted the limitations of current models to deal with unconstrained scenarios (e.g. surveillance conditions), while also displaying the importance of body information in conditions where audio quality is subpar and face access is not guaranteed. Following this premise, we design the first model that complements audio, face, and body information to achieve state-of-the-art performance in challenging conditions, in particular surveillance settings. Furthermore, this model also proposed a novel way to combine data via SE blocks, which allowed to provide reasoning behind model’s decision by visual interpretability. The use of SE blocks was also extended to other models and ASD-related areas to highlight the viability of this approach for model-agnostic interpretability. Although this initial model was superior to the state-of-the-art in challenging data, its performance in cooperative settings was not as robust. As such, we develop a new model that simultaneously combines face and body information in visual data extraction which, in conjunction with pretraining in challenging data, leads to state-of-the-art performance in both cooperation and challenging conditions (such as surveillance settings). These works pave a new way to assess human interaction in more challenging data and with model interpretability, serving as baselines for future works.

2025-05-23Doctoral thesis

Open access