OPEN ACCESS
The kmeans clustering (KMC) algorithm easily falls into the local optimum trap, if the initial cluster centers are not reasonable. To solve the problem, this paper puts forward a dynamic swarm firefly algorithm based on chaos theory and maxmin distance algorithm (FCMM). Firstly, the number of cluster centers, k, and their positions were determined by the maximin distance algorithm (MM). Then, a chaotic space was constructed by tent map based on the cluster centers, and the cluster centers were updated through chaotic search. In this way, the initial cluster centers are no longer concentrated in a small area, and the algorithm is not prone to the local optimum trap. The simulation results show that the FCMM successfully avoids the local optimum trap, achieves efficient classification of the initial dataset, and converges to the global optimal solution in an accurate manner.
kmeans clustering (KMC), maxmin distance algorithm (MM), firefly algorithm (FA), chaos theory
With the dawn of the big data era, great progress has been made in the field of data mining. As a classical method of data mining and analysis [1, 2], cluster analysis enjoys a huge application potential in pattern recognition, indoor positioning and statistics. The improvement of clustering accuracy has long been a research hotspot, aiming to satisfy the growing demand for data accuracy [3, 4].
Since the traditional kmeans clustering (KMC) algorithm is easily affected by the initial cluster centers and abnormal data [5, 6], many scholars have attempted to optimize the initial cluster centers of the traditional KMC in the light of feature correlation [7, 8] and develop clustering algorithms based on adaptive weight [9, 10]. The firefly algorithm (FA), a stochastic optimization algorithm mimicking the firefly behaviors, has been widely adopted to optimize the KMC for its ability to improve the accuracy of clustering algorithms. The FA is highly robust, easy to operate and supports parallel processing [11, 12], providing an effective solution to various optimization problems.
Considering the fact that the KMC, relying heavily on the initial cluster centers, is prone to fall into the local optimum trap, Pan et al. [13] proposed a firefly partitional clustering algorithm based on adaptive step length, which avoids the local optimum trap by replacing the fixed step length of the KMC with adaptive step length. Wang et al. designed a novel niche firefly partitional clustering algorithm to enhance population diversity [14]. Chen et al. [15] introduced a weighted Euclidean distance to optimize the initial cluster centers of the KMC, utilizing the strong global search ability and easy implementation feature of the FA.
The above FAoptimized KMC algorithms can achieve better clustering effect when the cluster centers are given. However, none of them clearly defines how to determine the number of cluster centers, k. The firefly partitional clustering algorithm based on adaptive step length has a small step length in the late stage of optimization, which easily causes slow convergence. What is worse, the dataset cannot jump out of the current cluster center, thus reducing the clustering accuracy.
To solve the above problems, this paper puts forward a dynamic swarm FA based on chaos theory and maxmin distance algorithm (FCMM). Firstly, the maxmin distance algorithm (MM) was adopted to determine the number of cluster centers, k, and the positions of the initial cluster centers. Then, the chaotic tent map, featuring uniform ergodicity and fast iteration, was employed to set up a chaotic search space with the initial cluster centers as the reference points. Next, the initial cluster centers were optimized by chaotic tent search, such that the algorithm can jump out of the local optimum trap and converge to the global optimum rapidly. Finally, the position update formula of the FA was used to allocate the sample points other than the cluster centers to suitable clusters, putting an end to the clustering process.
2.1 The KMC
The essence of the KMC is to classify a given dataset $X=\left\{X_{1}, X_{2}, \ldots, X_{n}\right\}$ into k classes $\left\{C_{1}, C_{2}, \ldots, C_{k}\right\}$. The classic KMC contains the following steps: First, selecting k objects randomly from the dataset X and taking them the initial cluster centers $C_{j}(j=1,2, \ldots, k)$ of k classes; Then, computing the Euclidean distance is computed between each remaining object $X_{i}(i=1,2, \dots, n)$ and each cluster center, and allocating each remaining object to the nearest class $C_{j}$; Next, recalculating the mean value of all objects in each class, and taking it as the new cluster center [16]. The above steps are repeated until the cluster center of each class no longer changes.
Definition 1. Euclidean distance
The Euclidean distance is the linear distance between two points in Euclidean space [17]. In an mdimensional space, the Euclidean distance between the samples $X_i$ and $X_j$ can be expressed as:
$d\left(X_{i}, X_{j}\right)=\sqrt{\sum_{k=1}^{m}\left(X_{i k}X_{j k}\right)^{2}}$ (1)
The KMC determines the similarity between samples based on the Euclidean distance.
2.2 The FA
Inspired by firefly motions, the FA computes the objective function value and relative attractiveness of each firefly according to its position and light intensity. The lighter fireflies move towards the brighter ones, using the position update formula. The moving distance depends on the attractiveness. The optimization of the FA adheres to the following three principles:
(1) Any two fireflies can attract each other, regardless of their gender.
(2) The attractiveness of a firefly is negatively correlated with its distance to another firefly and positively correlated with its light intensity. The lighter fireflies are attracted by and move towards the brighter ones, while the brightest firefly moves randomly.
(3) The light intensity of a firefly depends on the objective function value at its position.
Definition 2. The light intensity I of each firefly can be defined as:
$I \proptof\left(x_{i}\right), 1 \leq i \leq n$ (2)
$I=I_{0} \exp \left(\gamma * r_{i j}\right)$ (3)
where, $f\left(x_{i}\right)$ is the objective function; $\mathcal{X}_{i}$ is the spatial position of firefly i ; $I_{0}$ is the maximum light intensity; γ is a constant representing the light intensity absorption; $r_{i j}$ is the Euclidean distance between $X_{i}$ and $X_{j}$.
Definition 3. The attractiveness of each firefly can be defined as:
$\beta=\beta_{0} \exp \left(\gamma * r_{i j}^{2}\right)$ (4)
where, $\beta_{0}$ is the maximum attractiveness?
Definition 4. The position update formula can be described as:
$x_{i}(t+1)=x_{i}(t)+\beta\left(x_{j}(t)x_{i}(t)\right)+\alpha \varepsilon_{i}$ (5)
where, α is the initial step length; $\varepsilon_{i}$ is a random factor obeying the Gaussian distribution.
After initializing the cluster centers, the traditional KMC can be optimized with the randomness and global search ability of the FA. The FAbased optimization simulates the KMC process as the mutual attraction between fireflies, classifies the remaining objects accurately and speeds up the global convergence of the KMC. However, there are two defects with this optimization approach:
(1) No algorithm is available to determine the value of k, the number of cluster centers. If improperly selected, the k value may severely affect the clustering accuracy and computing complexity.
(2) The optimal clustering results correspond to the extreme points of the objective function, i.e. the cluster centers are close to local minimum points. Therefore, the algorithm easily falls into the local optimum trap.
To overcome these defects, this paper puts forward a dynamic swarm FA based on chaos theory and uses it to improve the KMC. The proposed FA draws on the sensitivity and ergodicity of chaotic mapping to initial values [18].
3.1 Determination of k by the MM
Like the traditional KMC, the MM allocates the sample points to cluster centers based on the nearest neighbor principle according to the Euclidean distance. The difference between the two algorithms lies in the determination of k. Instead of directly giving the k value, the MM selects a random object $X_{i}$ from the sample points as the first cluster center, computes the Euclidean distances of the remaining objects to $X_{i}$ by formula (1), and takes the object the furthest away from $X_{i}$ as the new cluster center. These steps are repeated until no new cluster center emerges, yielding the k value.
The MM is implemented in the following steps:
Step 1. Input θ (0<θ<1) and select the initial cluster center $Z_{1}=x_{1}$.
Step 2. Generate a new cluster center.
(1) Compute the Euclidean distance $D_{i 1}$ between each point and $Z_1$, and take the $x_k$ corresponding to $D_{k 1}=\max \left\{D_{i 1}\right\}$ as the new cluster center $Z_{2}$;
(2) Compute the Euclidean distances D_{i1} and D_{i2} between each point and cluster centers $Z_{1}$ and $Z_{2}$. If $D_{k 1}=\max \left\{D_{i 1}\right\} D D_{l}=\max \left\{\min \left(D_{i 1}, D_{i 2}\right)\right\}, i=1,2, \ldots n$ and $D_{l}>\theta * D_{12}$, take $x_l$ as the third cluster center $Z_3$.
Note that $D_{12}$ is the distance between $Z_1$ and $Z_2$:
$D_{i 1}=\left\x_{i}Z_{1}\right\=\sqrt{\sum_{i=1}^{d}\leftx_{i}Z_{1}\right^{2}, D_{i 2}}=\left\x_{i}Z_{2}\right\$
(3) If $Z_{3}$ exists, judge if $D_{j}=\max \left\{\min \left(D_{i 1}, D_{i 2}, D_{i 3}\right)\right\}$, i=1,2,...n. If yes and $D_{l}>\theta * D_{12}$ determine the fourth cluster center. Repeat the above process until $D_{l} \leq \theta * D_{12}$.
Step 3. Count the total number of cluster centers k.
The clustering results of the MM rely heavily on the selection of parameters and the first cluster center. If the sample distribution is not known in advance, repeated tests are needed for this algorithm to achieve a good clustering effect. As a result, this paper only uses the MM to determine the value of k.
3.2 Optimization of cluster centers with chaos theory
The random, ergodic and regular chaotic variables were introduced to improve the FAoptimized KMC, whose cluster centers often fall near local minimum points, aiming to enhance the global search ability and avoid the local minimum trap.
The logistic chaotic map faces obvious uneven ergodicity when the r value falls between 0 and 1, which suppresses the algorithm efficiency. Shan Liang et al. proved that the tent map outperforms logistic map in convergence speed and ergodic evenness, because the chaotic sequence generated by the tent map is more helpful to algorithm optimization [19].
(1) Tent chaotic sequence
The tent map can be expressed as:
$x_{t+1}=\left\{\begin{array}{c}{2 x_{t}, 0 \leq x_{t} \leq \frac{1}{2}} \\ {2\left(1x_{t}\right), \frac{1}{2} \leq x_{t} \leq 1}\end{array}\right.$ (6)
Through dyadic transform, the tent map can be rewritten as:
$x_{t+1}=\left(2 x_{t}\right) \bmod 1$ (7)
The tent chaotic sequence is generated through the following steps:
Step 1. Randomly generate an initial value $x_0$ not in the range of (0.20,0.40,0.60,0.80) and denote it as $z, z(1)=x_{0}, i=j=$1.
Step 2. Iteratively generate a sequence x by formula (7).
Step 3. Implement Step 2 if x(i)=[0,0.25,0.5,0.75] or x(i)=x(ik), (k=[0,1,2,3,4]).
Step 4. Change the initial value for iteration by x(i)=z(j+1), replace j with j+1, and implement Step 2.
Step 5. If the maximum number of iterations has been reached, terminate the iteration and save the sequence x.
(2) Chaotic search
The proposed FCMM generates a tent chaotic sequence based on the bestknown local optimal solution, and jumps out of the local optimum trap through tent search, thereby converging to the global optimal solution.
Specifically, the distances $D_{i x}$ between all cluster centers $C_{i}(i=1,2, \dots k)$ and the current cluster center $C_x$ were ranked in descending order. Then, the smallest n classes (30% of all cluster centers) $C_{i 1}, C_{i 2}, \ldots, C_{i n}$ and $C_x$ were selected, and the maximum $X^{j} \max$ and minimum $X^{j} \min$ of the n+1 classes in the jth dimension were computed, forming a new chaotic search space. After that, a chaotic sequence was generated based on cluster center $X_x$ of $C_x$, and used for chaotic search. The optimal solution obtained through the search was taken as the new cluster center.
Assuming that $C_x$ is the cluster center, $X_{k}=\left\{x_{k 1}, x_{k 2}, \dots x_{k d}\right\}$ and $x_{k j} \in\left[X_{\min }, X_{\max }\right]$, then the main steps of tent chaotic search can be explained as:
Step 1. Map $X_{x}$ to (0,1) by $z_{k j}^{0}=\left(x_{k j}X_{\min }^{j}\right) /\left(X_{\max }^{j}X_{\min }^{j}\right)$, where $k=1,2, \ldots, n, j=1,2, \ldots, D$.
Step 2. Substitute the above formula into formula (7) for tent map, and iteratively generate the chaotic variable sequence $Z_{k j}^{m}\left(m=1,2, \dots, C_{\max }\right)$b, where, $C_{\max }$ is the maximum number of iterations for the chaotic search [11].
Step 3. Restore $z_{k j}^{m}$ to the neighborhood of the original solution space by formula (8), producing a new $V_k$:
$v_{k j}=x_{k j}+\left(X_{\max }^{j}X_{\min }^{j}\right) *\left(2 z_{k j}^{m}1\right) / 2$ (8)
Step 4. Compute the light intensity $F\left(v_{k}\right)$ of $v_{k}$, compare it with the local optimal light intensity $F\left(x_{k}\right)$, and save the better solution.
Step 5. If the number of iterations has reached $C_{\max }$, terminate the search; otherwise, return to Step 2.
3.3 Basic steps of the FCMM
Step 1. Initialize the parameters: the total number of objects N, the light intensity absorption coefficient γ, the step length α, the maximum number of iterations for chaotic search $C_{\max }$, the maximum light intensity I and the maximum attractiveness $β_0$.
Step 2. Determine the number of cluster centers k by the MM, and record the position of the initial cluster center obtained by the MM.
Step 3. Construct the chaotic search space based on the cluster centers by tent map.
Step 4. Update the position of the initial cluster center by tent map until no new cluster center emerges.
Step 5. Consider cluster centers as fireflies with the highest light intensity, compute the Euclidean distances between remaining objects and each cluster center, and assign different light intensities to these objects by formula (3).
Step 6. If $I_{i}>I_{j}$, then firefly j has a smaller objective function value than firefly i, i.e. j is in a superior position than i. In this case, i is attracted by and moves towards j. Determine the movement pattern by formula (4) and update the firefly positions by formula (5).
Step 7. Repeat Step 6 until all fireflies have been allocated to their respective cluster centers.
Step 8. Output the results.
Three experiments were carried out to verify the effectiveness of the FCMM. The first experiment compares the clustering effects of the FCMM with the KMC and the FA; the second experiment tests the clustering accuracy and convergence speed of the three algorithms on the UCI datasets; the third experiment contrasts the clustering results of the FCMM with the firefly partitional clustering algorithm based on adaptive step length (Algorithm 1) and the KMC improved by weighted Euclidean distance (Algorithm 2).
4.2 Results analysis
(1) The first experiment
There are 200 samples scattering across the solution space. The number of cluster centers k was determined as 4 by the MM. Since the FA parameter setting has a great impact on experimental results, this paper obtains the parameter combination that leads to the most frequent occurrences of the optimal solution through parallel tests. A series of combinations between the step length α and the light intensity absorption coefficient γ were enumerated, and compared based on 30 sets of test results. Finally, the optimal parameters were determined as the maximum attractiveness $β_0$=100, the light intensity absorption coefficient γ=1, the step length α=0.06, the maximum number of iterations $C_{\max }$ =50, the maximum light intensity I=100, and θ=0.4. Using these parameters, the clustering results of the three algorithms are obtained as Figure 1.
As shown in Figure 1, the KMC clearly fell into the local optimum trap, as some of its cluster centers concentrated in a small area. The FA’s cluster centers were more distributed evenly than those of the KMC, but still need to be improved. The FCMM achieved better clustering effect than the KMC and the FA, evidenced by the uniform distribution of the cluster centers and the avoidance of the local minimum trap.
(2) The second experiment
The KMC, FA and FCMM were tested independently on six datasets, using the same parameters as the first experiment. The mean clustering accuracy of each algorithm is given in Table 1, and the convergence curve of each algorithm is presented in Figure 2.
Table 1. The mean clustering accuracy of each algorithm (%)
Dataset 
KMC 
FA 
FCMM 
Iris 
87.93 
91.13 
92.16 
Wine 
56.85 
70.23 
72.15 
Seed 
86.97 
88.07 
90.46 
Glass 
54.05 
57.18 
63.12 
HayesRoth 
77.32 
81.06 
82.35 
Newthyroid 
72.34 
79.63 
80.28 
Figure 1. Clustering effects of the three algorithms
Figure 2. Convergence curves of the three algorithms on six datasets
Table 1 shows that the FCMM improved the mean clustering accuracy by 7.51 % and 2.2 % from the levels of the KMC and the FA, respectively, owing to the determination of the k value by the MM and the optimization of cluster centers by the chaotic theory. It can be seen from Figure 2 that the FCMM converged to the global optimal solution faster than the KMC and the FA, without sacrificing the clustering accuracy.
(3) The third experiment
This experiment mainly compares the clustering accuracy and runtime of the FCMM with those of Algorithm 1 and Algorithm 2.
Table 2 demonstrates that the FCMM achieved much higher clustering accuracy than the two contrastive algorithms, despite the relatively long runtime. The time complexity of the FCMM is attributable to the analysis on the initial cluster center by the MM.
Table 2. Comparison of results via the different algorithms
Data set 
Algorithm 1 
Algorithm 2 
FCMM 

Clustering accuracy (%) 
Runtime (s) 
Clustering accuracy (%) 
Runtime (s) 
Clustering accuracy (%) 
Runtime (s) 

Iris 
91.49 
8.342 
91.22 
9.234 
92.16 
8.34 
Wine 
70.43 
14.432 
70.71 
13.832 
72.15 
13.725 
Seed 
89.70 
9.232 
89.52 
8.342 
90.46 
8.282 
Glass 
60.70 
18.345 
62.45 
17.532 
63.12 
17.425 
HayesRoth 
81.46 
11.243 
81.28 
10.734 
82.35 
10.432 
Newthyroid 
79.98 
11.344 
79.01 
10.232 
80.28 
10.109 
The KMC may easily fall into the local optimum trap, if the initial cluster centers are not suitable. To solve the problem, this paper designs a novel algorithm, the FCMM, that optimizes the KMC in two aspects: the number of initial cluster centers, k, and the position of cluster centers. Our way to determine the k value can effectively reduce the impact of k on the clustering algorithm. In addition, the positions of cluster centers were updated through chaotic search, based on the FAoptimized KMC. The chaotic map weakens the impacts of initial clustering position on the clustering effect, and fully utilizes the global search ability and fast convergence of the FA, making it possible to avoid the local optimum trap and converge to global optimal solution rapidly. The clustering effect of the FCMM was compared with that of other clustering algorithms and tested on several UCI datasets. The results show that the FCMM can achieve fast convergence, accuracy clustering and avoid the local optimum trap, when it is applied to cluster a few amounts of data.
[1] Esfahani, R.K., Shahbazi, F., Akbarzadeh, M. (2019). Threephase classification of an uninterrupted traffic flow: A kmeans clustering study. Transportmetrica BTransport Dynamics, 7(1): 546558. https://doi.org/10.1080/21680566.2018.1447409
[2] Huynh, P., Yoo, M. (2016). VLCbased positioning system for an indoor environment using an image sensor and an accelerometer sensor. Sensors, 16(6): 783. https://doi.org/10.3390/s16060783
[3] Franti, P., Sieranoja, S. (2019). How much can kmeans be improved by using better initialization and repeats. Pattern Recognition, 93: 95112. https://doi.org/10.1016/j.patcog.2019.04.014
[4] Wu, J., Yu, Z.J., Zhuge, J.C., Xue, B. (2016). Indoor positioning by using scanning infrared laser and ultrasonic technology. Optics & Precision Engineering, 24(10): 24172423. https://doi.org/10.3788/OPE.20162410.2417
[5] Fadaei, A., Khasteh, S.H. (2019). Enhanced kmeans reclustering over dynamic networks. Expert Systems with Applications, 132: 126140. https://doi.org/10.1016/j.eswa.2019.04.061
[6] Yayan, U., Yucel, H., Yazici, A.G. (2016). A low cost ultrasonic based positioning system for the indoor navigation of mobile robots. Journal of Intelligent & Robotic Systems, 78(34): 541552. https://doi.org/10.1007/s1084601400607
[7] Bai, L., Liang, J.Y., Guo, Y.K. (2018). An ensemble clusterer of multiple fuzzy kmeans clusterings to recognize arbitrarily shaped clusters. IEEE Transactions on Fuzzy Systems, 26(6): 35243533. https://doi.org/10.1109/TFUZZ.2018.2835774
[8] Wang, T. (2012). Novel sensor location scheme using timeofarrival estimates. IET Signal Processing, 6(1): 813. https://doi.org/10.1049/ietspr.2010.0305
[9] Lithio, A., Maitra, R. (2018). An efficient kmeanstype algorithm for clustering datasets with incomplete records. Statistical Analysis and Data Mining, 11(6): 296311. https://doi.org/10.1002/sam.11392
[10] Jiang J.,Zheng X., Chen Y. (2013). A distributed RSSbased localization using a dynamic circle expanding mechanism. IEEE Sensors Journal, 13(10): 37543766. https://doi.org/10.1109/JSEN.2013.2258905
[11] Franti, P., Sieranoja, S. (2018). Kmeans properties on six clustering benchmark datasets. Applied Intelligence, 48(12): 47434759. https://doi.org/10.1007/s1048901812387
[12] Fang, S.H., Lin, T.N. (2010). A dynamic system approach for radio location fingerprinting in wireless local area networks. IEEE Transactions on Communications, 58(4): 10201025. https://doi.org/10.1109/tcomm.2010.04.090080
[13] Pan, X.Y., Chen, X.J., Li, A. (2017). Firefly partition clustering algorithm based on selfadaptive step. Application Research of Computers, 34(12): 35763579.
[14] Wang, C., Lei, X.J. (2014). New partition clustering algorithm of niching firefly. Computer Engineering, 40(5): 173177. https://doi.org/10.3969/j.issn.10003428.2014.05.036
[15] Chen, X.X., Wei, Y.Q., Ren, M. (2018). Weighted Kmeans clustering algorithm based on firefly algorithm. Application Research of Computers, 35(2): 466470.
[16] Garcia, J., Crawford, B., Soto, R. (2018). A kmeans binarization framework applied to multidimensional knapsack problem. Applied Intelligence, 48(2): 357380. https://doi.org/10.1007/s1048901709726
[17] Santhi, V., Jose, R. (2018). Performance analysis of parallel kmeans with optimization algorithms for clustering on spark. Distributed Computing and Internet Technology, 10722: 158162. https://doi.org/10.1007/9783319723440_12
[18] Daniel, R., Rao, K.N. (2018). EECFM: Energy efficient clustering based on firefly and midpoint algorithms in wireless sensor network. KSII Transactions on Internet and Information Systems, 12(8): 36833703.
[19] Shan, L., Qiang, H., Li, J. (2005). Chaotic optimization algorithm based on tent map. Control and Decision, 20(2): 179182.