© 2018 IIETA. This article is published by IIETA and is licensed under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).
OPEN ACCESS
The gesture segmentation of mobile terminals faces two major problems: the segmentation effect is constrained by complex background, and the timeliness is dampened by the limited resources of mobile devices. To solve these problems, this paper puts forward a detection method based on improved mixed Gaussian background modelling. The first step is to improve the distribution of the mixed Gaussian background model. The number of Gaussian distributions was controlled adaptively to reduce system computing load and storage. Then, the learning rate was controlled in light of special scene change rate, aiming to enhance the adaptability of gesture segmentation to environmental changes. The experimental results show that the proposed method can rapidly eliminate environmental interferences, achieve effective hand segmentation and realize good computing performance, despite the massive changes to the background scene.
mixed gaussian model, background modelling, learning rate, gesture segmentation
The precision of gesture recognition depends heavily on accurate gesture segmentation (Chen et al., 2017), which in turn calls for precise detection of hand movement. Currently, the most popular detection methods for moving objects include time difference method, optical flow method and background subtraction method (Plyer et al., 2016; Gui et al., 2017). Among them, only the background subtraction method applies to the object detection of mobile terminals. Background subtraction mainly falls into mean background modelling, single Gaussian background modelling, and mixed Gaussian background modelling (Zheng et al., 2016; Goyal and Singhai, 2017; Mangal and Kumar, 2016). The mixed Gaussian background modelling, which represents the pixels with multiple Gaussian distributions, stands out for its ability to accurately simulate the background of multipeak distribution in realtime.
For mobile terminal applications, the gesture segmentation based on mixed Gaussian background modelling still has the following disadvantages: (1) The processing is slow due to the small computing and storage amounts of the mobile terminal; (2) The detection effect is affected by the virtual shadows easily produced at the dynamic changes of the background scene. To solve the defects, Katsarakis et al., (2016) accelerated the elimination of virtual shadows by controlling the spatial change of learning rate, which pushed up the mean learning rate per frame and increased the computing load. Azzam et al., (2016) put forward a spatial global mixed Gaussian model based on RGB pixels. Huang et al., (2015) presented a new moving object detection method based on pixelbased time and space information. However, all these methods require a high system storage.
In light of the above, this paper attempts to improve the mixed Gaussian background modelling method from two aspects. On the one hand, the number of Gaussian distributions was selected adaptively for each pixel, so as to reduce the consumption of mobile computing resources without sacrificing the detection effect; on the other hand, a special scene change rate was defined to control the learning rate and accelerate the elimination of the virtual shadows produced at massive changes of the background.
2.1. Construction of background model
Let X={X_{1}, X_{2}, X_{3},…,X_{t}} be the pixels of video frames collected in chorological order, with X_{t} being the pixel sample of the video frame collected at time t.
$P\left(X_{t}\right)=\sum_{i=1}^{K} \omega_{i, t} \mathrm{H}\left(X_{t}, \mu_{i, t}, \phi_{i, t}^{2}\right)$ (1)
where K is the number of Gaussian distributions in the mixed Gaussian model; ω_{i,t} is the weight of the ith Gaussian distribution in the mixed Gaussian model at time t;
$H\left(X_{\mathrm{t}}, \mu_{i, t}, \phi_{i, t}^{2}\right)=\frac{1}{\sqrt{2 \pi^{n}}\left\phi_{i, t}\right} e^{\frac{1}{2}\left(X_{i}\mu_{i}\right)^{T} \phi_{i}^{2}\left(X_{i}\mu_{i}\right)}$ (2)
$\frac{1}{2}\left(X_{t}\mu_{i, t}\right)^{T} \phi_{i, t}^{2}\left(X_{t}\mu_{i, t}\right)=\frac{\left(X_{t}\mu_{i, t}\right)^{T}\left(X_{t}\mu_{i, t}\right)}{2 \phi_{i, t}^{2}}=\frac{\left\left(X_{t}\mu_{i, t}\right)^{T} \\left(X_{t}\mu_{i, t}\right)\right}{2 \phi_{i, t}^{2}}=\frac{\left\left(X_{t}\mu_{i, t}\right) \\left(X_{t}\mu_{i, t}\right)\right}{2 \phi_{i, t}^{2}}=\frac{\left(X_{t}\mu_{i, t}\right)^{2}}{2 \phi_{i, t}^{2}}$(3)
According to (2) and (3), the probability density function of each mixed Gaussian distribution can be expressed as:
$H\left(X_{\mathrm{t}}, \mu_{i, t}, \phi_{i, t}^{2}\right)=\frac{1}{\sqrt{2 \pi^{n}}\left\phi_{i, t}\right} e^{\frac{\left(X_{t}\mu_{i t}\right)^{2}}{2 \phi_{i t}^{2}}}$ (4)
where $\mu_{i, t}$ and ω_{i,t} are the mean and the covariance matrix of the ith Gaussian distribution at time t, respectively; ω_{i,t} is the weight representing the similarity ratio between the sample value of the current distribution and the mixed model of image X.
Each new pixel X_{t} should be compared with the first K Gaussian distributions according to equation (5) below:
$\left\mathrm{X}_{\mathrm{t}}\mu_{\mathrm{i}, \mathrm{t}1}\right \leq 2.5^{*} \sigma_{\mathrm{i}, \mathrm{t}1}$ (5)
If the mean deviation of the Gaussian distribution is within 2.5σ, the new pixel matches the Gaussian distribution. If the Gaussian distribution satisfies the background requirement, the new pixel belongs to the background; otherwise it belongs to the foreground.
2.2. Update to the background model
The traditional mixed Gaussian background modelling often mistakes the massive changes of the background for the foreground. The resulting virtual shadows will dampen the modelling accuracy and the effect of foreground extraction.
If the new pixel X_{t} matches the ith Gaussian distribution at time t1, the weight, mean and covariance matrix should be updated by the following equations (Jian and Wang, 2014):
$\omega_{i, t}=(1\alpha) \omega_{i, t1}+\alpha$ (6)
$\mu_{i, t}=(1\beta) \mu_{i, t1}+\beta X_{i, t}$ (7)
$\sigma_{i, t}^{2}=(1\beta) \sigma_{i, t1}^{2}+\beta\left(X_{i, t}\mu_{i, t}\right)^{T}\left(X_{i, t}\mu_{t, t}\right)$ (8)
where α is the learning rate; β is the ratio of the learning rate to the weight.
If the new pixel X_{t} fails to match any Gaussian distribution, the original distribution with the smallest weight should be replaced without changing the mean and variance. The mean and covariance matrix should be updated as above, while the weight should be updated by the following equation:
$\omega_{i, t}=(1\alpha) \omega_{i, t1}$ (9)
After each update, all weights should be normalized such that the total weight equals 1, and all Gaussian distributions should be ranked again by priority ρ_{i,t}. Here, ρ_{i,t} refers to the priority of the ith Gaussian distribution in the mixed Gaussian background model at time t.
$\rho_{i, t}=\frac{\omega_{i t}}{\sigma_{i}}$ (10)
Many Gaussian distributions are generated during the computing process. Considering the limited resources of mobile terminals, the redundant Gaussian distributions should be removed. Thus, this paper proposes to scan all Gaussian distributions of each pixel at the interval of f frames. If a pixel has more than 3 Gaussian distributions, the distribution with the lowest priority should be deleted. Then, the weight of each Gaussian distribution should be checked. If the current weight is smaller than the initial weight and the priority is lower than the initial priority, the Gaussian distribution should be determined as redundant and deleted.
2.3. Foreground segmentation
After ranking all Gaussian distributions by priority ρ_{i,t}at time t, and the top B Gaussian distributions should be selected for further analysis. Then, the matching relationship between each pixel value X_{t}and the top B Gaussian distributions at time t. If the pixel value X_{t}matches one of the top B Gaussian distributions, then this pixel is a background point; if the pixel value fails to match any of the top B Gaussian distributions, then this pixel is a foreground point, i.e. a moving object. B should satisfy the following equations:
$B=\arg \left(\min _{b}\left(\sum_{i=1}^{b} \omega_{i, t}>T\right)\right)$ (11)
where T is the proportion of the background.
When the complex background undergoes massive change, the system may mistake part of the background as the foreground and extract it as a moving object. The resulting virtual shadows will dampen the effect of gesture segmentation (LopezRubio et al., 2015). The virtual shadows can be removed rapidly by increasing the learning rate. However, if the learning rate is fixed at a large value α_{1}, the moving object will be treated as part of the background if it does not move in a short time.
In view of this, a special scene change rate γ is designed here. If the scene change rate γ_{t} is above the threshold U, the learning rate should be increased to the large value α_{1}; if the scene change rate γ_{t} is below the threshold U, the learning rate should be reduced to the smaller value α_{0}.
Figure 1 shows the experimental results of the original and improved foreground extraction algorithms. Figure 1(a) presents a partial image of the video frame of the FGNet database. It can be seen that the hand was recognized as part of the background when it did not move in a short time. Figure 1(b) illustrates the extracted foreground when the learning rate was fixed at a large value α_{1}. It is clear that the hand almost disappeared from the foreground. Figure 1(c) is the foreground extracted by the improved method.
Figure 1. Experimental results of foreground extraction.
The great variation in the mean greyscale of the image reveals violent changes of the scene of the video frame at time t, while the small variation in skin colour of the image shows the limited movement of the moving object, i.e. the hand, in the foreground. Hence, there must be great changes to the background of the frame image. Thus, the scene change rate can be calculated as:
$\gamma_{\mathrm{t}}=\frac{R_{t}}{S_{t}}$ (12)
where R_{t} and S_{t} are respectively the mean change rate of grayscale and skin colour of the image. The mean grayscale change rate R_{t} can be obtained as:
$R_{t}=\frac{\lefth_{t}h_{t1}\right}{h_{t1}}$ (13)
where h_{t} is the mean greyscale of the image at time t. The mean skin colour change rate S_{t} can be obtained as:
$S_{t}=\frac{\leftH_{t}\delta\right\leftH_{t1}\delta\right}{\leftH_{t1}\delta\right}$ (14)
where δ is the parameter value; H_{t}is the mean H (hue) of the image at time t. The HSV (hue, saturation, value) is a colour space reflecting the intuitive features of colour. It is widely used for skin colour detection. The component H (hue) depicts the colour information of the image, which reacts relatively slowly to the change in illumination. Besides, the H of skin colour generally falls between 22 and 28 (Figure 2).
Figure 2. Concentrated area of the H of skin colour in HSV colour space.
Firstly, the video was inputted to collect the frame images. After that, these images were preprocessed, and the background was simulated by the improved mixed Gaussian background modelling method. The modelling was followed by rapid and accurate extraction of the foreground, and the segmentation of the gesture. The algorithm is implemented as follows:
4.1. Dataset collection
Two types of experimental data were collected from two sources, namely, a video from the FGNet database and a video taken by the author with a fixed camera. In the former video, the hand colour of the model was similar to the desk colour, and the objects on the desk were changed constantly (i.e. the video background is constantly changing). In the latter video, the background was complex and underwent massive changes, which can reflect the improvement effect.
4.2. Adaptive selection of the number of gaussian distribution
Table 1 lists the mean number of frames in the two videos processed per second by the original algorithm and the improved algorithm. It is clear that the improved algorithm was much faster than the original one and reduced the computing load of the mobile terminals.
Table 1. Processing speeds of the original and improved algorithms (fps/s).
Method 
Original algorithm 
Improved algorithm 
FGNet video 
2.36 
8.20 
The author’s video 
3.13 
10.42 
4.3. Controlling the learning rate based on special scene change rate
Figure 3. Adaptabilities of the original and improved algorithms.
This subsection further compares the foreground extraction effect of the original algorithm and the improved algorithm. The original frame video is shown in Figure 3(a), and the effects of the original and improved algorithms are displayed in Figures 3(b) and 3(c), respectively. As shown in Figure 3, when another person suddenly entered the background, the original algorithm incorrectly extracted the person to the foreground, while the improved algorithm eliminated the virtual shadows rapidly. Hence, the original algorithm cannot adapt to the background changes, while the improved algorithm has strong adaptability to the massive changes of the background.
The original and improved algorithms were further contrasted by three parameters: precision, recall and F1measure. The precision and recall rate evaluate the quality of the extraction results, while the F1measure is the harmonic mean of the precision and recall, reflecting the overall performance of the extraction.
High precision indicates the correct rate of the detection is high, and high recall means a high proportion of gestures are correctly detected (i.e. few gestures are not detected). The FImeasure cannot reach a high value unless these parameters are at a high level. As shown in Table 2 below, the proposed algorithm enjoys a high F1measure.
Table 2. Precisions, recalls and F1measures of the original and improved algorithms.
Source 
Parameter 
Original algorithm 
Improved algorithm 
FGNet video 
Precision (%) 
76.65 
89.39 
Recall 
30.62 
75.20 

F1Measure 
44.91 
81.33 

The author’s video 
Precision 
98.03 
97.25 
Recall 
48.62 
76.13 

F1Measure 
63.86 
85.33 
For rapid gesture segmentation of mobile terminals, this paper proposes a fast, improved mixed Gaussian background modelling method. The proposed method was modified from the traditional mixed Gaussian background modelling method through the optimization of the learning rate and the number of Gaussian distributions. Through experiment, it is proved that the improved algorithm is adaptable to massive changes of the background and works effectively in gesture segmentation.
This work was supported in part by National Natural Science Foundation of China under Grant No. 61672461 and No. 61672463.
Azzam R., Kemouche M. S., Aouf N., Richardson M. (2016). Efficient visual object detection with spatially global Gaussian mixture models and uncertainties. Journal of Visual Communication and Image Representation, Vol. 36, No. 1, pp. 90106. https://doi.org/10.1016/j.jvcir.2015.11.009
Chen D., Li G., Sun Y. (2017). An Interactive Image Segmentation Method in Hand Gesture Recognition. Sensors, Vol. 17, No. 2, pp. 539550. https://doi.org/10.3390/s17020253
Cui Z. G., Wang H., Li A. H. (2017). Moving object detection based on optical flow field analysis in dynamic scenes. Acta Physica Sinica, Vol. 66, No. 8, pp. 5663. https://doi.org/10.7498/aps.66.084203
Goyal K., Singhai J. (2017). Review of background subtraction methods using Gaussian mixture model for video surveillance systems. Artificial Intelligence Review, Vol. 2, No. 1, pp. 246252. https://doi.org/10.1007/s104620179542x
Huang W., Liu L., Yue C. (2015). The moving target detection algorithm based on the improved visual background extraction. Infrared Physics and Technology, Vol. 71, No. 1, pp. 518525. https://doi.org/10.1016/j.infrared.2015.06.011
Jian C. F., Wang Y. (2014). Batch Task Schedulingoriented Optimization Modelling and Simulation in Cloud Manufacturing. International Journal of Simulation Modelling, Vol. 13 No. 1, pp. 93101.
Katsarakis N., Pnevmatikakis A., Tan Z. H. (2016). Improved Gaussian Mixture Models for Adaptive Foreground Segmentation. Wireless Personal Communications an International Journal, Vol. 87, No. 3, pp. 115. https://doi.org/10.1007/s1127701526283
LopezRubio F. J., Dominguez E., Palomo E. J., LópezRubio E., Baena R. M. L. (2015). Selecting the Color Space for SelfOrganizing Map Based Foreground Detection in Video. Neural Processing Letters, Vol. 43, No. 2, pp. 345361. https://doi.org/10.1007/s1106301594318
Mangal S., Kumar A. (2016). Real time moving object detection for video surveillance based on improved GMM. International Journal of Advanced Technology & Engineering Exploration, Vol. 4, No. 26, pp. 1722.
Plyer A., Besnerais G. L., Champagnat F. (2016). Massively parallel Lucas Kanade optical flow for realtime video processing applications. Journal of RealTime Image Processing, Vol. 11, No. 4, pp. 713730. https://doi.org/10.1007/s1155401404230
Zheng A., Zhang L., Zhang W. (2016). Localtoglobal background modeling for moving object detection from nonstatic cameras. Multimedia Tools and Applications, Vol. 76, No. 8, pp. 1100311019. https://doi.org/10.1007/s1104201635651