OPEN ACCESS
Outlier detection is an important research direction in data mining. This paper mainly applies outlier detection algorithm in intrusion detection, and proposes a novel outlier detection algorithm based on neighbor propagation clustering. The proposed algorithm first clusters the original dataset, then calculates the outlier degree, and finally mines the intrusion behaviors. To verify its effectiveness, our algorithm was tested on public dataset on network intrusions. The results prove that our algorithm can detect intrusions with a high accuracy.
intrusion detection, outlier detection, data mining, clustering, neighbor propagation
In recent years, network intrusions are constantly emerging, which greatly affected the normal access to the Internet. It is urgently needed to detect network intrusions and curb these attacks. The essence of intrusion detection is to quickly detect the intrusion behaviors from a large amount of data, and provide the response mechanism of the solutions to these intrusions [1]. To date, many different techniques have been introduced to improve the detection of network intrusions.
Outlier detection, a.k.a. outlier data mining and anomaly detection [2], is a research direction in data mining. Outlier detection aims to detect few data points that are different from a large dataset through appropriate methods. The possible basses of outlier detection algorithms include statistics, distance, density, deviation, and clustering [3, 4].
Statisticsbased algorithm [5], as the earliest outlier detection method, assumes that the data obey a specific type of distribution, namely, Poisson distribution and Gaussian distribution, verifies the assumption of data distribution, and identifies the data points outside the distribution as outliers.
Distancebased algorithm [6] can be applied to a wider scope than the statisticsbased algorithm, because it does not need to know the data distribution in advance. However, the distancebased algorithm performs poorly in detecting local outliers, for the distance is obtained through global calculation.
Densitybased algorithm [7] is extended from distancebased algorithm. Local outliers are also taken into account in this algorithm.
Deviationbased algorithm [8] first divides the dataset into data blocks according to the prior conditions. The data points with high outlier degree and their neighbors are allocated to the same data block. Then, the outliers in the data blocks are identified, and the outlier degree is characterized by the discrete coefficient.
Clusteringbased algorithm [9] treats any data point in the clustering results that does not belong to any class as an outlier. The outliers and outlier clusters are found by clustering algorithm. As a result, clusteringbased algorithm outperforms any other outlier detection algorithm.
To sum up, outliers can be detected by many algorithms. But most algorithms cannot achieve the ideal detection effect. The clusteringbased algorithm is relatively accurate, thanks to the inherent advantages of clustering.
Through the above analysis, this paper introduces neighbor propagation into outlier detection, and designs an outlier detection algorithm called neighbor propagation clustering.
Intrusion detection is an attack identification technique for network, computer, and other systems. Using a monitoring program, Heberlein et al. [10] monitored network data and created a monitoring dataset, which serves as the data source for intrusion detection. Their research marks the formal establishment of network intrusion detection as a research direction. Daneshpazhouh et al. [11] successfully applied clustering analysis to intrusion detection model, highlighting the value of unsupervised learning in intrusion detection.
Outlier detection method, an important tool of data mining and unsupervised learning, has been gradually introduced to intrusion detection. Shaikh et al. [12] carried out outlier detection on a dataset assumed to obey Gaussian distribution. Östermark [13] put forward the kth nearest neighbors (kNN) algorithm, which is applicable to distancebased outlier detection algorithm. Ghoting et al. [14] presented the restricted block relocation problem (rBRP) algorithm, making distancebased outlier detection algorithm suitable for higher dimensional datasets. Based on the concept of connectivity, Pai et al. [15] proposed a new algorithm that can accurately detect outliers in categorical data. Sun et al. [16] developed a deviationbased outlier detection algorithm, which outshines the general outlier detection algorithms in local outlier detection. Liu et al. [17] designed a clusteringbased outlier detection algorithm: the original data are split into k clusters, and the outlier points are pinpointed in these clusters. Gan et al. [18] combined kmeans clustering (KMC) and outlier mining into a twostage outlier detection algorithm; the relatively good detection effect is directly affected by the parameter setting of the KMC.
In intrusion detection, an attack is a behavior distinct from most data, similar to an outlier in the dataset. The similarity provides the basis for applying outlier detection algorithm in intrusion detection. So far, the following outlier detection algorithms have been introduced to intrusion detection, including frequent patternbased algorithm [19], densitybased algorithm [20], and depthbased algorithm [21]. The previous studies have confirmed that both outlier detection and intrusion detection focus on the following aspects: data processing, model construction, method application, and technology selection. In the real world, however, the application is often inaccurate and inefficient. To solve the problem, this paper improves a clustering algorithm for outlier detection, and applies it to optimize the effect of intrusion detection.
3.1 Neighbor propagation
Based on neighbor information [22], neighbor propagation aims to maximize the similarity between each data point and the nearest cluster head through information transmission. With the preset number of clusters, the algorithm can perform well and achieve a good clustering effect.
Neighbor propagation is a similaritybased clustering algorithm. There are great resemblances between neighbor propagation and the distancebased KMC, the densitybased spatial clustering of applications with noise (DBSCAN), and the similaritybased spectral clustering.
Since distance is the basis of similarity and density, neighbor propagation can be compared to the said three algorithms. For example, in the classical KMC, the manually set number of clusters might be unsuitable due to the lack of empirical knowledge; meanwhile, neighbor propagation regard each sample as a candidate cluster head, sets up a similarity matrix, and looks for the suitable cluster heads through the mutual propagation of responsibility and availability between data points.
The main concepts of neighbor propagation are defined below:
Similarity: This fuzzy concept reflects how similar two data points are. Before being applied to scientific research, the concept must be quantified. In scientific research, similarity is generally depicted by distance.
Distance: This quantifiable concept refers to the interval length between data points. In scientific research, the concept is often used to measure similarity and cluster data. The common forms of distance include Euclidean distance, Manhattan distance, Chebyshev distance and Hamming distance. The distance is negatively correlated with the similarity between data points.
Similarity matrix S_{m}: The similarity matrix S_{m} represents similarity S(i, k), i.e. the probability of data point x_{k} as the cluster head of data point x_{i}. The matrix S_{m} could be symmetric or asymmetric, making it possible to expand the applicable scope of neighbor propagation. The similarity S(i,k) is usually measured by distance. For simplicity, similarity was measured by the negative value of Euclidean distance:
$S(i, k)= x_{k}x_{i}^{2}$ (1)
Probability: The S(m, m) on the diagonal of matrix S_{m} reflects the probability of data point m as the cluster head. The S(m, m) value is positively correlated with that probability. When the neighbor propagation is initialized, all data points have the same probability. In the absence of prior knowledge, the mean p of all similarities is generally taken as the initial value.
As shown in Figures 1 and 2, responsibility and availability are two kinds of information transmitted between data points. The competition between data points is realized through the propagations of responsibility and availability.
Responsibility r(i, j) is the information generated by data point x_{i} and candidate cluster head x_{j} during the updating process. The information is sent to data point x_{j} to judge whether x_{j} is suitable as the cluster head of x_{i}.
Availability a(i, j) is the information generated by candidate cluster head x_{j} and data point x_{i} in the updating process. The information is sent to data point x_{i} to judge whether x_{j} is suitable as the cluster head x_{i}.
This research only considers the positive support of the other data points for x_{j} to serve as the cluster head. The greater the values of larger r(i, j) and a(i, j), the higher the probability for x to become the cluster head, and the more likely that x_{i} belongs to the same class as x_{j}.
Figure 1. Propagation of responsibility
Figure 2. Propagation of availability
In neighbor propagation, the responsibility and availability of data points are updated iteratively. Once stable cluster heads are converged, the other data points are assigned in turn to the clusters of the cluster heads. The responsibility matrix R=[r(i,j)] can be updated by:
$r(i, j) \leftarrow s(i, j)\max \left\{a\left(i, j^{\prime}\right)+s\left(i, j^{\prime}\right)\right\}\left(j^{\prime} \neq j\right)$ (2)
$r(j, j) \leftarrow p(j)\max \{a(i, j)+s(i, j)\}(j \neq i)$ (3)
The availability matrix $A=[a(i, j)]$ can be updated by:
$a(i, j) \leftarrow \min \left\{0, r(j, j)+\sum_{i^{\prime} \notin j} \max \left(0, r\left(i^{\prime}, j\right)\right)\right\}$ (4)
$a(j, j) \leftarrow \sum_{i^{\prime} \neq j} \max \left(0, r\left(i^{\prime}, j\right)\right)$ (5)
Then, a damping factor dam is introduced to curb the data oscillation arising from the unstable number of cluster heads in the iterations. In each iteration, the current responsibility R_{i} and availability A_{i} are adjusted according to the previous R_{i1} and A_{i1} by:
$R_{i}=(1d a m) \times R_{i}+d a m \times R_{i1}$ (6)
$A_{i}=(1d a m) \times A_{i}+d a m \times A_{i1}$ (7)
From formulas (6) and (7), it can be seen that the $\mathrm{R}_{i}$ and $\mathrm{A}_{i}$ obtained in each iteration are affected by the damping factor dam. If dam is relatively large, R_{i} and A_{i} will be greatly influenced by R_{i1} and A_{i1}. In this case, the number of clusters changes slightly from that in the previous generation.
The clustering of neighbor propagation outputs a set of class clusters C_{ω},ω=1,2,…,C through iterative processing of the dataset x_{i}, i=1,2,…,n. The entire clustering process can be divided into the following steps:
Step l. Initialize the responsibility matrix R and the availability matrix A as zero matrices, set up the similarity matrix S_{m}, and use the median of S_{m} as the initial probability P of a data point as cluster head.
Step 2. Update the responsibility matrix R.
Step 3. Update the availability matrix A.
Step 4. Repeat Steps 2 and 3 until one of the following two conditions is satisfied: the maximum number of iterations I_{max} and the minimum number of stable iterations of cluster head T_{s}.
Step 5. If r(j,j)+a(j,j)>0, take data point j as the cluster head.
Step 6. Assign the other data points to the corresponding clusters, and terminate the clustering process.
The clustering quality directly hinges on the P value. The greater the P value, the more data points can serve as cluster heads, and the more the number of clusters.
3.2 Outlier detection algorithm based on neighbor propagation clustering
The outlier degree is a relative concept. The data points that differ from others are generally treated as outliers. In clusteringbased outlier detection algorithm, the data points that do not belong to any cluster or belong to a cluster that deviates from most clusters are considered as outliers. Such a cluster is commonly referred to as an outlier cluster.
Clustering is a suitable way to mine outliers from the original dataset. The clusters offer an intuitive picture of the outliers, making it easy to identify the causes of outlier behavior. Based on the clustering result, the integrity of a dataset consists of two parts: cluster and outlier. The former represents the main body of the dataset, and the latter is an auxiliary form.
After clustering, the outliers and outlier clusters must be mined by a suitable strategy. The distancebased kNN algorithm, widely known for its clustering effect, is not very effective in outlier mining. To solve the problem, this paper innovatively introduces the concept of cluster partition, that is, dividing the clusters into large and small clusters.
Overall, a large cluster contains a huge amount of normal data, and a very few outliers that deviate from the cluster head; a small cluster is very likely to be an outlier cluster, due to the limited number of data points. The cluster partition is defined as follows:
For dataset $D,$ the results of neighbor propagation clustering can be expressed as $C=\left\{C_{1}, C_{2}, C_{3}, \cdots, C_{n}\right\}, C_{i} \cup C_{j}=\emptyset, C_{1} \cup$ $C_{2} \cup \cdots \cup C_{k}=D$ and $\leftC_{1}\right \geq\leftC_{2}\right \geq \cdots \geq\leftC_{k}\right .$ It is assumed that the first $d$ clusters are large clusters $C_{\mathrm{L}}=\left\{C_{i} \mid i \leq d\right\},$ and the rest are small clusters $C_{\mathrm{S}}=\left\{C_{j} \mid j \leq d\right\} .$ Then, the proportion of large clusters in all data can be defined as:
$\left(\leftC_{1}\right+\leftC_{2}\right+\cdots+\leftC_{d}\right\right) \geqD * \alpha$ (8)
Since the proportion of outliers is generally 10%20%, α is generally set to 0.80.9. Then, the data volume of each large cluster should be at least β times that of a small cluster:
$\frac{\leftC_{d}\right}{\leftC_{d+1}\right} \geq \beta$ (9)
The value of β is generally set to 24, depending on the specific dataset.
For any data point t in dataset D, suppose dist(t,C_{i}) is the Euclidean distance between the data point and the cluster head of C_{i}. Then, the outlier degree C_{o}(t) of data point t in a small cluster and a large cluster can be respectively calculated by:
$\left\{\begin{array}{c}\frac{1}{C_{i}} * \min \left(\operatorname{dis}\left(t, C_{i}\right)\right), t \in C_{j}, C_{j} \in C_{S}, C_{i} \in C_{\mathrm{L}} \\ i=1,2, \cdots, d \\ \frac{1}{C_{i}} *\left(\operatorname{dis}\left(t, C_{i}\right)\right), t \in C_{i}, C_{i} \in C_{\mathrm{L}}\end{array}\right.$ (10)
The first line of formula (10) indicates the relationship between the number of data points in a small cluster and the distance between each point t and the head of the nearest large cluster. If most data points in a small cluster are found to be outliers, the small cluster will be identified as an outlier cluster.
The second line of formula (10) indicates the relationship between the number of data points in a large cluster and the distance between each point t and the cluster head.
In this way, the outliers that deviate from stable large clusters can be determined easily. Following the above definition of outlier degree, the outliers can be obtained by looking for the maximum outlier degree, even if two classes have fuzzy boundaries or if the data points of two classes are mistakenly clustered into one class.
The neighbor propagation was combined with outlier degree into an outlier detection algorithm based on neighbor propagation clustering. The algorithm outputs the topn outliers based on the original dataset D and parameters α and β. As shown in Figure 3, the proposed algorithm works in the following steps:
Figure 3. The flow chart of proposed algorithm
Step 1. Set up the similarity matrix S_{m}=(ω_{ij})_{k×k} for dataset D, where $\omega_{i j}=\left\x_{i}x_{j}\right\^{2}$.
Step 2. Iteratively update the responsibility matrix R and the availability matrix A.
Step 3. Obtain l+s cluster heads that satisfy to r(i,j)+a(j,j)>0, and assign the other data points into the corresponding clusters.
Step 4. Partition dataset D into l large clusters C_{l} and s small clusters C_{s}, according to the input parameters α and β and the definitions of large clusters and small clusters.
Steps 5. Calculate the outlier degree C_{o}(t).
Step 6. Sort the C_{o}(t) values of all data points in descending order, and take the first n points as outliers.
To verify its feasibility, the proposed algorithm was tested on a simple twodimensional (2D) artificial dataset containing local outliers and outlier clusters. In addition, our algorithm was compared with the densitybased local outlier factor (LOF) algorithm and the KMC algorithm in the detection of outliers and outlier clusters.
The information of the artificial dataset is listed in Table 1. The data points in the dataset are distributed on a 2D plane, forming four large clusters in the corners and a fivepoint outlier cluster in the middle. In addition, some outlier data points scatter around the large clusters.
The outlier detection results of our algorithm were compared with those of the LOF and the KMC. The comparison shows that the LOF could not identify the outliers in the multiclass data. The KMC identified the outlier clusters, but did not detect some outliers; a possible reason is that the KMC only relies on distance as the measure, and easily falls into the local optimum trap, failing to accurately find out some outliers. By contrast, our algorithm detected most outliers and outlier clusters, and obviously outperformed the other two algorithms. The experiment on the artificial dataset preliminarily proves the feasibility of our algorithm.
Next, the feasibility and accuracy of our algorithm were further verified on real datasets from the UCI Machine Learning Repository. The information on the selected datasets are given in Table 2.
The reliability of outlier detection algorithm is usually evaluated by two metrics: accuracy and false positive rate. However, the outliers detected by our algorithm are data points in the top places of the ranking. It is not convincing to evaluate the detection effect only based on the topn outliers. The evaluation of ordered results can be transformed into the evaluation of the ranking of search results. The higher the outlier degree ranking of detected outliers, the more effective the algorithm. In this way, the effectiveness of outlier detection algorithm can be evaluated more accurately.
Table 1. Information of the artificial dataset
Dataset 
Number of data points 
Number of attributes 
Number of clusters 
Number of outliers 
Number of outlier clusters 
Artificial dataset 
230 
2 
4 
25 
1 
Table 2. Information of UCI datasets
Dataset 
Number of samples 
Number of attributes 
Number of clusters 
Number of outlier points 
Iris 
152 
3 
3 
12 
Wine 
176 
12 
2 
16 
Seeds 
215 
8 
3 
22 
Breast Cancer 
698 
10 
4 
36 
Then, the mean average precision (MAP) [23] can be adopted to measure the effect of outlier detection. The MAP is the mean accuracy after evaluating relevant information, which reflects the singlevalue index of algorithm performance on all data points. The higher the rank of detected targets, the higher the MAP.
Before calculating the MAP, it is necessary to compute the average precision (AP) of each query. The AP can be obtained by computing and accumulating the accuracy at each position in a query, and dividing the accumulated result by the position of the final query:
$A P=\frac{1}{10} \int_{0}^{1} p(r) d r=\int_{0}^{1} p(r) d r$ (11)
where, p(r) is the accuracy p at the position r in a query. After that, the APs of N queries are accumulated and average into the MAP:
$M A P=\frac{1}{N} \sum_{i=0}^{N} A P_{i}$ (12)
The accuracies and MAPs of our algorithm, the LOF and the KMC are compared in Tables 3 and 4, respectively.
Table 3. Accuracy comparison

Iris 
Wine 
Seeds 
Breast Cancer 
LOF 
65% 
27% 
23% 
28% 
KMC 
45% 
29% 
28% 
49% 
Our algorithm 
80% 
45% 
47% 
55% 
Table 4. MAP comparison

Iris 
Wine 
Seeds 
Breast Cancer 
LOF 
66% 
35% 
27% 
45% 
KMC 
52% 
36% 
39% 
53% 
Our algorithm 
86% 
58% 
59% 
62% 
As shown in Tables 3 and 4, all three algorithms detected most outliers of Iris dataset in time. Our algorithm was more excellent than the other two algorithms.
For Wine dataset, our algorithm found outliers and outlier clusters much more accurately than LOF and KMC.
For Seeds dataset, the three algorithms each detected about 2545 data points, but varied in the detection accuracy of the outliers with higher outlier degrees.
For Breast Cancer dataset, our algorithm achieved the best detection result, followed in turn by KMC and LOF. The relatively good results of our algorithm and KMC are attributable to the fact that some highly similar outliers form outlier clusters.
The above results show that our algorithm is highly effective in outlier detection, despite failing to identify 1530 outliers. In all four datasets, our algorithm outperformed the two contrastive methods.
Most outlier detection algorithms cannot effectively detect outliers without complex parameter settings. To solve the problem, this paper designs an outlier detection algorithm based on neighbor propagation clustering with outlier degree. Experimental results show that our algorithm is more effective and accurate in identifying the outliers in artificial dataset and UCI dataset than the LOF and KMC. Of course, our algorithm still has a high complexity, and a relatively low efficiency facing large datasets. To improve its efficiency, the future research will try to apply our algorithm to distributed environment.
[1] Raja, S., Jaiganesh, M., Ramaiah, S. (2017). An efficient fuzzy selfclassifying clustering based framework for cloud security. International Journal of Computational Intelligence Systems, 10(1): 495506. https://doi.org/10.2991/ijcis.2017.10.1.34
[2] Tian, L., Fan, Y., Li, L., Mousseau, N. (2020). Identifying flow defects in amorphous alloys using machine learning outlier detection methods. Scripta Materialia, 186: 185189. https://doi.org/10.1016/j.scriptamat.2020.05.038
[3] Sun, G., Bin, S., Jiang, M., Cao, N., Zheng, Z., Zhao, H., Xu, L. (2019). Research on public opinion propagation model in social network based on blockchain. CMCComputers Materials & Continua, 60(3): 10151027.
[4] Yuan, Z., Zhang, X., Feng, S. (2018). Hybrid datadriven outlier detection based on neighborhood information entropy and its developmental measures. Expert Systems with Applications, 112: 243257. https://doi.org/10.1016/j.eswa.2018.06.013
[5] Zhang, Y., Hamm, N.A., Meratnia, N., Stein, A., Van De Voort, M., Havinga, P.J. (2012). Statisticsbased outlier detection for wireless sensor networks. International Journal of Geographical Information Science, 26(8): 13731392. https://doi.org/10.1080/13658816.2012.654493
[6] Ahn, J., Lee, M.H., Lee, J.A. (2019). Distancebased outlier detection for high dimension, low sample size data. Journal of Applied Statistics, 46(1): 1329. https://doi.org/10.1080/02664763.2018.1452901
[7] Bai, M., Wang, X., Xin, J., Wang, G. (2016). An efficient algorithm for distributed densitybased outlier detection on big data. Neurocomputing, 181: 1928. https://doi.org/10.1016/j.neucom.2015.05.135
[8] Zhao, X., Zhang, J., Qin, X. (2017). LOMA: A local outlier mining algorithm based on attribute relevance analysis. Expert Systems with Applications, 84: 272280. https://doi.org/10.1016/j.eswa.2017.05.009
[9] Jiang, F., Liu, G., Du, J., Sui, Y. (2016). Initialization of Kmodes clustering using outlier detection techniques. Information Sciences, 332: 167183. https://doi.org/10.1016/j.ins.2015.11.005
[10] Heberlein, L.T., Dias, G.V., Levitt, K.N., Mukherjee, B., Wood, J., Wolber, D. (1989). A network security monitor (No. UCRLCR105095). Lawrence Livermore National Lab., CA (USA); California Univ., Davis, CA (USA). Dept. of Electrical Engineering and Computer Science. https://doi.org/10.2172/6223037
[11] Daneshpazhouh, A., Sami, A. (2014). Entropybased outlier detection using semisupervised approach with few positive examples. Pattern Recognition Letters, 49: 7784. https://doi.org/10.1016/j.patrec.2014.06.012
[12] Shaikh, S.A., Kitagawa, H. (2014). Efficient distancebased outlier detection on uncertain datasets of Gaussian distribution. World Wide Web, 17(4): 511538. https://doi.org/10.1007/s112800130211y
[13] Östermark, R. (2009). A fuzzy vector valued KNNalgorithm for automatic outlier detection. Applied Soft Computing, 9(4): 12631272. https://doi.org/10.1016/j.asoc.2009.03.009
[14] Ghoting, A., Parthasarathy, S., Otey, M.E. (2008). Fast mining of distancebased outliers in highdimensional datasets. Data Mining and Knowledge Discovery, 16(3): 349364. https://doi.org/10.1007/s1061800800932
[15] Pai, H.T., Wu, F., Hsueh, P.Y.S.S. (2014). A relative patterns discovery for enhancing outlier detection in categorical data. Decision Support Systems, 67: 9099. https://doi.org/10.1016/j.dss.2014.08.006
[16] Sun, G., Bin, S. (2018). A new opinion leaders detecting algorithm in multirelationship online social networks. Multimedia Tools and Applications, 77(4): 42954307. https://doi.org/10.1007/s110420174766y
[17] Liu, J., Deng, H. (2013). Outlier detection on uncertain data based on local information. KnowledgeBased Systems, 51: 6071. https://doi.org/10.1016/j.knosys.2013.07.005
[18] Gan, G., Ng, M.K.P. (2017). Kmeans clustering with outlier removal. Pattern Recognition Letters, 90: 814. https://doi.org/10.1016/j.patrec.2017.03.008
[19] Said, A.M., Dominic, P.D.D., Faye, I. (2015). Data stream outlier detection approach based on frequent pattern mining technique. International Journal of Business Information Systems, 20(1): 5570. https://doi.org/10.1504/IJBIS.2015.070892
[20] Zhang, Z., Zhu, M., Qiu, J., Liu, C., Zhang, D., Qi, J. (2019). Outlier detection based on cluster outlier factor and mutual density. International Journal of Intelligent Information and Database Systems, 12(12): 91108. https://doi.org/10.1504/IJIIDS.2019.102329
[21] Jiang, F., Sui, Y., Cao, C. (2011). A hybrid approach to outlier detection based on boundary region. Pattern Recognition Letters, 32(14): 18601870. https://doi.org/10.1016/j.patrec.2011.07.002
[22] Bin, S., Sun, G., Cao, N., Qiu, J., Zheng, Z., Yang, G., Xu, L. (2019). Collaborative Filtering Recommendation Algorithm Based on MultiRelationship Social Network. CMCComputers Materials & Continua, 60(2): 659674. https://doi.org/10.1016/10.32604/cmc.2019.05858
[23] Kurland, O., Lee, L. (2009). Clusters, language models, and ad hoc information retrieval. ACM Transactions on Information Systems (TOIS), 27(3): 139. https://doi.org/10.1145/1508850.1508851