An Efficient Abnormal Event Detection System in Video Surveillance Using Deep Learning-Based Reconfigurable Autoencoder

ABSTRACT


INTRODUCTION
Digital cameras with improved features and increased memory largely due to latest advances in chip design, have replaced the original analog cameras.The latest enhancements in chip technology have decreased the cost of ICs used in digital cameras making them affordable to everyone.Increased affordability has increased its use by private businesses such as retail shops, malls etc. and public services such as museums, banks, airports, etc. for surveillance.We live in a time where video surveillance is extensively used to ensure safety of the people and to reduce crime.Due to this widespread usage of video surveillance, there is a huge amount of data generated from these video surveillance cameras.As per his Data Display [1], the video data generated from the video surveillance cameras installed on or before 2016 is about 566 GB per day.It also estimates that due to the increasing use of surveillance cameras the video data produced per day by the surveillance cameras will increase to 3500 GB per day by 2023.It is a staggering task to assess this vast amount of video data manually.Intelligent surveillance technology is necessary for video comprehension, organized processing and analysis of this huge data.One of the core components of an efficient automatic AI monitoring technology is abnormal event detection which identifies a small number of inconsistent anomalous occurrences among the most typical occurrences in huge surveillance video data in real-time.
Abnormal event detection in videos using deep learning involves training models to identify and flag events that deviate from normal patterns or behaviors within a given video stream as shown in Figure 1.This is commonly applied in various fields such as surveillance, industrial monitoring, healthcare, and more.
Abnormal even detection has taken Centre stage in most major research concerning pattern recognition and computer vision.The variety of the scenes which can be classified as abnormal events are many and this is a challenge.Definition of the parameters that encompasses the range of potential anomalous events is challenging.In order to interpret abnormal events statistically, it is typical to classify them as low probability occurrences compared to normal events.Unexpected events and events that are out of the ordinary and are not in line with the usual daily occurring activities present in most normal samples can be treated to be abnormal events.The existing pattern recognition and computer vision approaches currently in use for identifying abnormal events comprise of two parts [2][3][4].The first part is the event representation part which is the data gathering stage, where the different features of the video scene is extracted; the different features extracted depend on the different type of models used.For example, the features can be image level features or object level features.Image level features concentrate on the different pixel values of the image.The different features in the two-dimensional image pixels or the three-dimensional video pixels are used to generate the gradient of the image or the histogram of the image and assess it through the help of scientific methods as a spatiotemporal gradient (STG) [5], optical histograms for optical flow (HOF) [6][7][8][9], and also perhaps to derive the textures of the image through the mixture of dynamic textures (MDT) [10].Similarly, the object level features track the objects in the image, such as track their trajectory [11] or their appearance or any change to them [12] in order to indicate an event such as in sports images or History images or energy images etc.
The second part is the anomaly detection part.This part you generally consist of a model which is designed to detect an abnormal event through the anomalies of the different features gathered in the first part.Anomaly detection is usually done through a set of rules which establish what a normal event looks like and any event or a test event which does not follow these set of rules is treated as an anomaly or an abnormal event.The most popular anomaly detection models existing today are cluster-based detection models [13], models based on state inference [14] and on sparse reconstruction [15,16].The cluster-based detection model classifies normal analogous events into clusters.Test samples which are very dissimilar to these clusters are classified as abnormal events.The state inference model works on the presumption that a normal event will behave with similar changes with which respect to variation in time where as an anomaly will not undergo these changes with respect to time.The sparse reconstruction model makes use of the sparse Representation of the two-dimensional images or the three-dimensional videos and their features.The sparse representation of a normal event will have a trivial error in comparison to a sparse representation of an anomaly.
Similar topics in image processing, computer vision and pattern recognition which make use of deep learning techniques such as object recognition [17,18], object detection [19] behaviour recognition [20] and health diagnosis have been successful.This is mainly due to the joint optimization of the two different parts of the deep learning technique.Due to this joint Optimization process the model can discover abnormality in a wide variety of classes and is no longer subject to abnormality sensing in a specific class of normal events.Deep learning techniques have been used on similar topics deriving accurate prediction and results and hence researchers have applied it on abnormal event detection [21][22][23][24][25].In the study by Luo et al. [26], a model which makes use of three auto encoders for three different channels is proposed.The auto encoder learns the appropriate features in their respective channels.Further a support vector machine (SVM) predicts an anomaly score for each channel which is later combined to derive the final decision on anomaly detection.A similar auto encoder method is presented by Xia et al. [27] where the abnormal event is detected through a sparse auto encoder and its reconstruction error.Similarly, Xiong et al. [28] makes use of two contrasting auto encoders where one is fully connected the other one is a completely constitutional model.These two auto encoders combined learn the occurrence of normal events with respect to time.Then through the variation in time of the reconstructed error the abnormal events are detected.A major drawback in the above methods for anomaly detection is that any sample which does not coincide with the parameters of a normal event is considered an anomaly without taking into consideration the low percentage of occurrence of an anomaly.This leads to false representation of with samples which are not abnormal also being represented as anomalies.
While deep learning has shown great promise in various computer vision tasks, including abnormal event detection in videos, it's important to acknowledge that there are certain disadvantages and challenges associated with using deep learning methods for this purpose [29,30].Here are some of the key disadvantages:

Data and annotation challenges
Deep learning models require large amounts of labeled data for training.Annotated datasets for abnormal event detection in videos can be scarce and expensive to create.Annotating anomalies in videos often requires domain expertise and can be subjective, leading to inconsistencies in annotations [31].

Model complexity and overfitting
Deep learning models, especially complex ones, can be prone to overfitting.Anomalies are by definition rare events, making it challenging to train models that can generalize well to unseen anomalies without overfitting to normal behaviors.

Computational resource demands
Training deep learning models for video analysis requires significant computational resources, including highperformance GPUs and substantial memory.This can lead to high infrastructure costs, making it challenging for smaller research groups or organizations with limited resources to adopt these methods effectively.
Need for large-scale data Anomalies come in various forms, and training a model to recognize them requires exposure to a wide range of abnormal events.Acquiring and annotating such a diverse dataset can be difficult and time-consuming.

Interpretable results
Deep learning models are often seen as black boxes, making it challenging to understand and interpret the decisions they make.In applications like abnormal event detection, interpretability is crucial for understanding why a certain event was flagged as abnormal, especially in critical domains like surveillance and healthcare.
Transferability to new environments Models trained on one set of videos may not generalize well to different environments or camera setups due to variations in lighting, camera angles, and scene layouts.Fine-tuning or retraining the model on new data is often necessary, which can be resource-intensive.

Adversarial attacks
Deep learning models, including those used for abnormal event detection, can be vulnerable to adversarial attacks.Malicious actors can manipulate videos or scenes to deceive the model into misclassifying normal or abnormal events.

Lack of sufficient anomaly samples
In some cases, acquiring sufficient samples of rare anomalies for training might be practically impossible.This can result in models that struggle to identify anomalies that are not well-represented in the training data.

Real-time processing
Many applications of abnormal event detection require realtime or near-real-time processing.Achieving low-latency predictions with deep learning models can be challenging and may require specialized hardware or optimization techniques.

Model robustness
Deep learning models can be sensitive to noise, variations, and minor changes in the input data.Ensuring that a model remains robust in dynamic real-world scenarios is a constant challenge.
Despite these challenges, researchers and practitioners continue to work on improving deep learning methods for abnormal event detection in videos.Techniques like transfer learning, data augmentation, explainable AI, and adversarial training are being explored to mitigate some of these disadvantages and improve the reliability and generalizability of models.
The proposed method comprises of Reconfigurable auto encoder (RAE) [32][33][34][35] which presents an inclusive deep learning technique unlike the other existing methods for anomaly detection.The RAE deep learning framework can map the known raw input data to the low dimensional unknown layers of the deep learning technology.Then a Gaussian distribution model constraints the invisible layers representation such that the Gaussian distribution value for a normal event is comparatively much higher than the distribution of an anomaly which is comparatively much smaller [36].By representing the raw data with respect to the unknown hidden layers and using the Gaussian distribution for anomaly detection performs the two main parts of the proposed model that is event representation and anomaly detection.It performs a joint optimization through an inclusive learning framework which has the ability to perform in generalized scenarios.Through two different open source and popular data sets the robustness in the generalization of the proposed method is assessed.Apart from being able to apply to different scenarios the quality of its anomaly detection is on par with current technologies.

RAE FOR ANOMALY DETECTION
The proposed method of anomaly detection using RAE consists of training phase and a testing phase.The training phase initially collects dense samples from the threedimensional video input in the spatial domain.The raw, manipulated pixel values are input to the RAE model [37].These high dimensional values are mapped to the unknown layers of the RAE model in order to derive its Gaussian distribution representation.Then during the testing phase, the Gaussian representation of the test sample is obtained from the hidden layers of the RAE model [38,39].This Gaussian distribution representation leads to an anomaly score for the test sample.Finally, all the test samples with an anomaly score below the threshold are classified as an abnormal event.This section elaborates the proposed method of variational auto encoder for abnormal event detection in surveillance videos.The working of an auto encoder is first discussed here.

Working of the autoencoder
The input to the auto encoder is converted and stored in form of its invisible lay or representation [40,41].The original input can be recovered through this invisible layer representation through the encoder, decoder composition.The expressions for the encoder f(x) and decoder g(z) are shown Eq. ( 1) and Eq. ( 2): x If x is the input data, then the auto encoder invisible layer representation of the data is represented by z and the input data reconstructed by the auto encoder is represented by x ′ .The terms  1 and  2 represent the hidden layer invariables of the neural network.The minimum reconstruction error has to be maintained in order to obtain effective reconstruction of the raw input data [42].The reconstruction error is represented by the expression in Eq. ( 3): The invisible layer representation of the input data can also be considered to be the features of the data.Hence this representation by the auto encoder can be used in the pattern recognition stage of the abnormal event detection model.In order to improve on the feature extraction ability of the invisible layer of the neural network, the noise reduction auto encoder [43] was designed, this introduces a noise term which improves the expressiveness of the model to avail better feature extraction.Similarly, a sparse encoder [44] was designed to introduce sparsity constraints which makes the model effective even in case of partial loss of data.Similarly error based reconstruction [45][46][47] has been widely used in auto encoders.The invisible layer representation is sometimes directly used as a feature and has been successful in event detection.One major drawback that is identified in all these above methods is that these methods do not take into consideration the probability of the different samples, i.e., abnormal events have a very low probability compared to the normal events which is not considered in the representation.Hence taking into consideration the vital probability of the different events the proposed method assumes the hidden layer representation to follow a Gaussian distribution [48,49].Further the abnormal event in the surveillance video is detected through a variational auto encoder.

Abnormal event detection through variational auto encoder
An 'N' number of samples of size 'S' are used to train the variational auto encoder [50][51][52] to identify the Gaussian distribution within the input data.Assuming that the training samples are all samples of normal events, the Gaussian distribution of these normal events are all clustered together to create one cluster centre.In case of the input data representing and abnormal event in the Gaussian distribution of such an abnormal event will have a Centre which is far from the class to centre of the training data.The events whose gorgeous distribution Centre is far from the training date of class the centre is classified to be an abnormal sample or an abnormal event as given Eq. ( 4).z ~ (0, I) The reconstruction of the original data of the variational auto encoder is very similar to the reconstruction process of the auto encoder.The Invisible layer representation of the input data has to satisfy the condition in equation 4 where 'I' represent the identity matrix.The variational auto encoder comprises of two neural networks, which make up the encoder and decoder of the invisible layer.The loss function of the variational encoder is represented through the expression in Eq. ( 5).

L(θ, φ, x) = E z~q φ (z|x) [log p θ (x|z)]
−D KL (q φ (z|x)||p(z) This loss function is responsible for regulating the reconstruction and to ensure the reconstruction is effective.The sampling for each sample in the training data according to the Monte Carlo method as given shown in Eq. ( 6).
where, the training data x i is represented through its invisible layer representation z i .
The LHS in Eq. ( 7) is the log likelihood of the training data necessary for effective reconstruction by the decoder.The RHS represents the Kullback-Leibler divergence measure between the two probability distributions.This divergence value will be small for two similar probability distributions.Based on this estimation q φ (z|x) gives the normal distribution through From Eq. ( 3), p(z) can be reduced to From the above expressions in Eq. ( 7) and Eq. ( 8) the second term in the RHS of 4 can be expressed as in Eq. ( 9) for effective abnormal event detection.

Prediction
After the training process the test data is input to the network.The features of the test sample are identified and is represented in the form of the invisible layer z ′ through the inferred network.The probability of z ′ being in the Gaussian Distribution is given in Eq. (10).
In case the test sample is a normal event it will be in the high probability area of the Gaussian distribution, while an abnormal event will have a lower value.Hence inference on occurrence of an abnormal event is derived through a threshold as given in Eq. (11).
The sensitivity of the of the detection procedure is represented through δ.

EXPERIMENT
The results of the proposed abnormal event detection model are compared with various existing methods using two different data sets UCSD Ped1 [53] and Avenue data set [54].

Data set and its indicators for experimental analysis
The first data set used is UCSD Ped1 data set this data set has 36 normal and 38 abnormal samples the samples are taken through a surveillance camera fixed at the payment the size of the pictures are 238×158 with 220 frames per clip.In this data set people walking on the payment are classified as normal events and small motor vehicles bicycles and small cars on the payment and people walking on the lawn are classified as an abnormal event.
The second data set is the Avenue data set which is a data set used for monitoring students and people around the school premises.Here the normal events include people walking.While abnormal events include people running, hanging around during class hours, throwing things, misbehaving etc.The data set comprises of 30652 frames of size 360×240.Both the data sets are captured through a surveillance camera whose lens is positioned obliquely.A set of sample images from the Avenue data set is shown in Figure 2, Figures 2(a) and 2(b) show the UCSD Ped1 dataset and Avenue dataset, respectively.the images in the top are considered as normal events, while the images in the bottom are classified as abnormal events.
Two different types of indicators pixel-level and framelevel [55] are used for evaluation: (1) frame level indicators: in this case if there exist one or more abnormal pixels in a frame such a frame is classified as abnormal.
(2) pixel level indicators: here if the abnormal area of the sample coincides with at least 40% of the real abnormal area then such an event is classified as abnormal.Once the samples are classified as normal and abnormal the true positives and the false positives of the classified samples are estimated.Then through the use of the threshold δ [56] The ROC curves in the graph in Figure 3 are plotted, and Figures 3(a

Setup for training the proposed model and subsequent test data estimation
The proposed abnormal event detection model comprises of 4 invisible layers, the initial layer comprises of 500 neurons while the subsequent layers comprise of 2500 and 30 neurons each.This symmetrical network structure is trained using the normalized input vectors of size 500×1.This vector is derived from the test samples of size 160×120 with video clips divided into a size of 10×10×5.An Adam Optimizer [57] is selected by the optimizer in the network.The network uses a progressive learning rate which reduces by 1⁄10 for every thousand iterations starting from 0.0001.The maximum number of iterations is set to be 10000.For evaluation the surveillance video is converted into test samples of size 10×10×1, These three-dimensional samples are divided in such a way that they are unique and do not have any overlapping events with respect to time.The valuation begins with the invisible layer representation of the test sample.Through Eq. 10 it is determined whether the sample represents a normal or an abnormal event.The neural network is implemented using Tensor Flow and Python in an NVIDIA GTX 1070 TI with 8GB memory.The results of the proposed method are compared with current existing methods in trend [58][59][60][61][62][63][64][65].These are represented as 'Sparse reconstruction cost (SRC),' 'Markav Chain monte Carlo (MCMC),' 'Reconstruction error of auto-encoder (REAE)' and 'Channel state information (CSI)' respectively.

Results and analysis
The proposed method is compared with four popular existing methods out of which three methods make use of separate sequential feature extraction and event detection procedures.The type of features extracted in each model varies for instance in case of SRC the dynamic textures of the frames are used as the features and in case of the second method the variations in the gradient of the pixels which respect to the spatial and temporal domain representation is the features of the image, where as in the third method an auto encoder is used for feature extraction.These methods have a twostep procedure where once the features of an image have been extracted a subsequent abnormal event detection model is responsible for identifying the abnormal event.In case of the first method an inference using the image, statistics is used for abnormal event detection.While the second method make use of a Sparse model for reconstruction and abnormal event detection.The third method uses and SVM model of a single vector class for abnormal event detection.The results from the proposed and the existing methods are compared through the use of the ROC curves.The ROC values of the existing methods are retrieved from the corresponding papers, CSI paper does not provide the ROC curves data.Figure 3 represents the frame level and the pixel level comparison of the ROC curves for the different methods.
From Figure 3, it can be assessed that the proposed method excels in frame-level evaluation in comparison with the existing methods.In the case of pixel-level evaluation, the proposed method is on par with the existing MCMC, REAE, and SRC.The UCSD Ped1 data set is used for the assessment, and the results have been tabulated in Table 1. Figure 4 shows the comparison analysis between UCSD Ped1 and the existing method.The proposed abnormal event detection method excels with 92.3% in frame-level AUC and achieves 71.4% in pixel-level evaluation.The second method, which makes use of temporal gradient values, performs much lower compared to the proposed method.This can be due to each frame acting as a separate input to its neural network.The frame-level evaluation values for the Avenue data set are shown in Table 2.The proposed method is compared with MCMC and CSI and is empirically proven to perform better than both of these methods and Figure 5. shows the Comparison of the Avenue dataset's frame-level AUC% with the convolution method.
The generalization of the proposed model is tested and from Figure 6 it is seen that the proposed model can successfully detect various abnormal events such as trolley skateboard etc. Hence proving its good generalization ability.Examples of results from partially accurate detection on two datasets are shown in Figure 6.The test results from the USCD Ped1 dataset are (a) and (b), and the test results from the Avenue dataset are (c) and (d).
Finally, the proposed model is tested in terms of speed of abnormal event detection the results are tabulated in

CONCLUSION
In this paper, a proposed reconfigurable autoencoder (RAE) using end-to-end deep learning is used to determine the normal and abnormal events detected in video surveillance.As per the analysis, the proposed method shows that the probability of abnormal event detection is small as compared to the normal detection event in the Gaussian distribution.The proposed methods reduce efficiency compared to the simulation parameters to get better performance.As per the simulation result, the proposed method shows better performance as compared to the conventional methods with respect to the AUC(%).In the UCSD Ped_1 dataset, the proposed method has been improved by 1.02% (frame level AUC%) and 1.03% (pixel level AUC%) with respect to the avenue data set's frame level AUC(%).The proposed method works better than the MCMC method, with a 1.01% improvement in the frame-level AUC(%) when the Avenue dataset is used for the simulation.Additionally, the two phases are concurrently optimised to raise the method's precision and generalizability.The numerical outcomes in the two public datasets demonstrate that the suggested approach has reached the level of technological development at the moment.The following stage of research will take into account implementing the suggestions on increasingly complicated datasets.

Figure 1 .
Figure 1.Fundamental flow chart of ad normal event detection in video using deep learning

1 .
The encoder network   (|) : It is a probabilistic encoder which maps the invisible layer data representation to a very near posterior distribution represented as (|) .This network is called as an inferred network.2. The decoder network   (|): It is a generative decoder which reconstructs the original training data  ′ from its invisible layer representation, without the use of any input prior.These above two networks are bound by the constraints θ and φ.These constraints are responsible for invisible layer representation of the training data 'z' by the encoder and the reconstruction of this invisible layer representation to generate the reconstructed training data  ′ .
) and 3(b) show the frame-level ROC and pixel-level ROC, respectively.(a) Samples from the UCSD Ped1 dataset (b) Examples from the avenue dataset

Figure 2 .Figure 3 .
Figure 2. Several examples of occurrences from the dataset for detecting anomalous events

Figure 4 .Figure 5 .
Figure 4. Comparison analysis between UCSD Ped1 and the existing method

Table 3 .
This test makes use of UCSD ped1 data set.It is seen that the propose method has the fastest rate of detection with 571 fps in comparison with the existing methods.The hardware used for this empirical analysis is Intel Core i7-8700k 3.7 GHz CPU and NVIDIA GTX 1070Ti.with 8GB video memory GPU and 16GB RAM.The software used is Python 3.7 and TensorFlow 1.7.

Table 1 .
AUC% comparison of the UCSD Ped1 dataset with the conventional methods

Table 2 .
Comparison of the avenue dataset's frame-level auc% with the conventional methods

Table 3 .
Running time comparison on the UCSD Ped1 dataset Figure 6.Illustrations of the detection's results