Given a sequence of video frames, we use pose estimation to extract the keypoints of every person in each frame
and use a pose tracker to track the skeletons across the
frames. Eventually, each person in a clip is represented as a
temporal pose graph. Our network maps the training samples into a Gaussian-distributed latent space and calculates
the probability of a human pose sequence occurring based on the training data.
We demonstrate our algorithm in two settings. The first
is the widely used ShanghaiTech Campus dataset. In
this setting, the training data consists of only normal video
samples, and the test data consists of both normal and abnormal videos.
The second setting is supervised anomaly
detection, using the recent synthetic UBnormal dataset,
which consists of both normal and abnormal training data.
For this setting, we use our suggested normalizing flows
model with a Gaussian Mixture Model prior. This forces
the network to assign low probabilities to known abnormal
samples.
Extensive experiments show that our model outperforms
the previous pose-based and appearance-based state-of-theart methods for both settings. In addition, the ablation study
shows our method is robust to noise and can generalize over
different datasets. We show that while training on synthetic
data and evaluating on real data, our model’s performance
only slightly degrades, although there is a considerable difference in appearance
The ShanghaiTech Campus data set is one of the largest data sets for video anomaly detection, containing videos from 13 cameras around the ShanghaiTech University campus. It consists of 330 training videos with only normal events and 107 test videos with both normal and abnormal events, annotated at both frame and pixel levels. A few examples of human anomalies in the dataset are running, fighting, and riding bikes. The videos contain various people in each scene, with challenging lighting and camera angles.
The UBnormal data set is a new synthetic supervised open-set benchmark containing both normal and abnormal actions in the training videos. It contains 268 training videos, 64 validation videos, and 211 test videos and is also annotated at both frame and pixel levels. Some scenes in the dataset include foggy and night scenes. The pose detector overcame these difficult conditions and accurately estimated the poses in such scenes. This provides additional evidence for the advantages of working with a non-appearance-based model, which can focus on learning actions and disregard the illuminations or background of a scene.
@article{hirschorn2022human,
title = {Normalizing Flows for Human Pose Anomaly Detection},
author = {Hirschorn, Or and Avidan, Shai},
journal={arXiv preprint arXiv:2211.10946},
year = {2022},
}