Comparative Survey of Deep Learning Architectures for Video Anomaly Detection: CNNs, Autoencoders, VAEs, RNNs, GANs, and Hybrids

Wei Wang

Abstract


This comparative assessment looks at various deep learning architectures for video anomaly detection (VAD), including CNNs, Autoencoders (AEs), Variational Autoencoders (VAEs), Recurrent models (RNN/LSTM), GANs, and hybrids. We look at more than 60 studies on standard benchmarks like UCSD Ped1/2, CUHK Avenue, ShanghaiTech, UMN, and Subway. We use unified measures such as frame-level AUC, F1, precision/recall, and computational characteristics like inference latency/compute. Some reported findings that are representative are: AE-based methods getting AUCs of about 0.92–0.98 on UCSD variations; a ConvLSTM-VAE getting AUC = 0.965; and prediction-/hybrid-based models getting excellent AUCs on UCSD/Avenue/ShanghaiTech. We combine robustness to occlusion and domain shift, the effects of temporal modeling (like ConvLSTM-AE vs. static AE), latent-space modeling in VAEs (single vs. mixture of Gaussians), the trade-offs of adversarial training (reconstruction vs. adversarial loss, mode collapse), and hybrid designs (like CNN+RNN, AE+memory). We point out problems that still need to be solved in large-scale standardized datasets and cross-scene generalization. We also give a choice matrix for practitioners to use when choosing a model that takes into account compute, latency, and data limits.


Full Text:

PDF


DOI: https://doi.org/10.31449/inf.v49i29.9376

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.