A Machine Learning Approach on Outlier Removal for Decision Tree Regression Method

ABSTRACT


INTRODUCTION
Outliers are defined by objects that are few and diverge from the majority object [1,2].Outliers in data sets can occur due to systematic measurement errors and missing covariates [3].In contrast to noise, defined as misclassification (class noise) or attribute error (attribute noise), outliers are a broader concept encompassing inconsistent data arising from natural population or process variation [4].In other words, an outlier is a rare and unexpected occurrence that is very different from a regular occurrence [5].
However, ML algorithms can be used to overcome these limitations [22,24].One of the ML popular methods for removing outliers is the Isolation Forest [25][26][27].The Isolation Forest can determine anomaly scores by collecting particular trees, the so-called isolation trees [28].The advantage of this is its computational efficiency on high dimensional data [29].This algorithm is used for anomaly detection and is characterized by its linear time complexity, showing superior detection capability on perceptual data [30].
This research objective is to improve the accuracy of the prediction of the Decision Tree Regression (DTR) method by integrating outlier removal using the Isolation Forest method.DTR is a widely used ML method for prediction [28].DTR is like a classification tree with roots, nodes, and leaves.DTR is a robust ML algorithm that offers outstanding advantages such as transparency, simplicity, and versatility in handling different data types.However, there are also limitations associated with DTR, such as the risk of overfitting [29,30].Overfitting occurs when training data are too complex, noisy, incomplete, etc., [30,31].Overfitting due to outliers is often problematic in regression and classification models [32,33].
This paper consists of the following parts.The first section introduces the outliers and methods for outlier detection.Isolation Forest is used as an alternative for outlier removal, and DTR is used as the regression method.Section 2 describes the methodology used: decision tree regression, Isolation Forest, and the proposed method.Then, section three presents the experimental results and discussion.Finally, Section 4 presents the paper's result, conclusion, and future work.

RESEARCH METHODS
This research uses a modified DTR supervised learning approach by integrating the Isolation Forest method as outlier removal in data pre-processing.In addition, this research focuses on improving prediction accuracy in supervised learning.

Decision tree regression
Decision Tree (DT) is used for both classification and regressive analyses [31,32].This method is advantageous when dealing with decision-related problems [31].DT works by continuously dividing the input data at each branch and creating a prediction method at each part (node) based on the target value (output).This division results in a visual representation of a decision tree comprising branches and nodes.The first or internal node is the tree's root, with outward edges, while the others are called leaves.Each node within this tree framework executes a binary judgment that distinguishes one or more categories from the rest [33].The Decision Tree Regression (DTR) is based on the DT method and creates a prediction method as a tree structure [34].Figure 1 illustrates the DT structure.The characteristics of an item are analyzed in DTR, and a tree-shaped method is employed to make precise prognostications for forthcoming data and relevant continuous results [35][36][37][38][39].
Figure 1.The decision tree structure [31] In a regression problem, let  =  1 ,  2 , … ,   be variables of predictor where  is the predictor variable's total number.Let  and  =  1 ,  2 , … ,   be the number of observations and a target variable that takes continuous values.The  is a feature variable and ℎ is a value threshold [36].Let  and  = (, ℎ  ) be a node and candidate split, respectively.
Eq. ( 1) illustrates  1 , that the decision tree's left branch is determined by dividing the data into potential split candidates.
( ) ( , ) Eq. ( 2) illustrates   , denoted as the right side in the decision tree, is determined by dividing the data into γ potential splits.Furthermore, Eq. ( 2) can be defined as .  ̅  , representing the average of predicted value at the terminal nodes.Assume  to be the number of samples present at the current node.
Eq. ( 3) illustrates the computation of the average estimated value at the terminal nodes.

Isolation Forest
The Isolation Forest is a collection of binary trees, called Isolation Trees, designed to isolate data points [28].The algorithm generates individual isolation trees that merge into an ensemble method, the Isolation Forest.The tree creation depends on the decisions determined by the data set format [40].The Isolation Forest functions optimally with huge datasets as it has the time complexity of a linear function and low memory overhead [41].So, the Isolation Forest technique is an unsupervised approach to outlier detection from a collective-based method, where an isolation score is calculated for every data point [42].Briefly, the distribution is split multiple times through Isolation Forests at random domain values, and then the number of splits required to isolate each point is counted.Points that require less splitting are more likely to be outliers.The outlier score is determined by the number of necessary splits or output functions from numerous repetitions of this process [43].To determine how an instance is unique compared to other cases based on their respective path lengths, it is essential to calculate the outlier score mathematically represented by Eq. ( 4), and Eq. ( 5) is used to evaluate the isolated trees average path length [40,44,45].
2( 1) ( ) 2 ( 1) With an anomaly score of (, )is;  is the dataset size; the average path length of the instance  over a tree collection (ℎ()) and an average path length of an unsuccessful search in a binary search tree given  samples () .( − 1) is a harmonic number and can be approximated by ( − 1) + 0.5772156649.Anomaly scores range from 0 to 1. Scores close to 1 indicate that an anomaly has been detected, scores below 0.5 indicate normal data and values close to 0.5 indicate no obvious anomaly.

Proposed method
The proposed method of the paper consists of several steps, which are explained below.The first step is data acquisition.At this stage, the data required for analysis is collected.The data preprocessing involves missing data imputation and feature selection.
The preprocessing method applied is similar to that described by Van et al. [29].To handle the issue of missing data values, the K-Nearest Neighbours Imputer technique (KNNImputer) is applied as a method to fill in these empty values.The KNNImputer replaces missing values by evaluating target values from its nearest neighbors.In this approach, the missing values are filled using an approximation of the target value calculated through the average of the -nearest neighbor data values.The number of neighbors considered is determined by the n_neighbors parameter in the KNNImputer, which is in this study using the parameter value _ℎ equals 3 as used in previous research [29], is utilized, meaning the algorithm will consider three nearest neighbor data points to fill in the missing values.
Then, the second step is outlier removal from the dataset.At this stage, the outliers are detected using Isolation Forest, the ML-based outlier detection method, and then removed from the dataset in the third step.This process aims to clean the data from deviant values that can affect prediction accuracy.In the fourth step, the data was divided into two groups: train data and test data.Training data is used to train the method and analyze patterns in the data, while test data is used to test the performance of the process that has been created.
The fifth step is to generate a DTR method.A decision tree regression model is constructed using the sci-kit-learn library, with the following parameters specified: random_state = 0, max_depth = 6, max_leaf_nodes = 100.Then, training data cleaned from outliers is used to train the DTR method.The method's performance is validated using K-fold crossvalidation.Finally, a comparison of the results is performed based on Mean Absolute Error (MAE), R-Square (R 2 ), and Root Mean Square Error (RMSE) and compared to other previous research and methods.
This entire process forms a framework or methodology that can be used to analyze and improve calculation results.Algorithm 1 presents the proposed algorithm in this research.The devised method is implemented on the datasets [46].In that paper, two datasets are utilized, namely the Air Quality Index (AQI) dataset provided by the Central Pollution Control Board (Dataset 1) and Open Government Data (Dataset 2) India.
The first dataset presents 29,531 daily samples recording the average AQI from January 2015 to June 2020.There are 12 significant environmental pollutant variable values, including PM10, PM2.5, Carbon Monoxide (CO), Ozone (O3), Nitrogen Dioxide (NO2), NOx, NO, Sulphur Dioxide (SO2), NH3, Benzene, Toluene, and Xylene.However, out of these 12 variables, only the most relevant ones will be selected through a feature selection stage to analyze AQI values.
Meanwhile, the Air Quality Index (AQI) Dataset 2 from Open Government Data contains 1,574 samples taken every hour in January 2020.This dataset is more focused, presenting AQI values and six other major pollutants, namely PM10, PM2.5, Ozone (O3), Nitrogen Dioxide (NO2), Sulphur Dioxide (SO2), and Carbon Monoxide (CO).These six variables are the focus of the second dataset for analysis and modeling related to air quality.
The features selected to predict the AQI value are performed by analyzing Pearson's Correlation Coefficient (PCC) between the target value and 12 pollutant variables, as shown in Table 1.The variables chosen as features to predict the AQI value must have a correlation value of at least 0.45 or higher.Therefore, the prediction analysis will consider only variables with a significant relationship with the AQI value.These features in Dataset 1 are used as features in Dataset 2 [29].In the implementation, the dataset is split using a 75:25 ratio, where 75% of the data is designated for method training, and the remaining 25% is used for assessing the method's performance.The algorithmic method employed in this research is implemented using Python 3.11.6,along with Pandas 2.1.1,NumPy 1.26.1, and Scikit-learn 1.3.1.

Evaluation method
The MAE is a metric used to evaluate the regression method.It calculates the mean of the predicted errors over all instances to give the final score and assesses the variation between the value of the predicted instance and the actual value [47,48].This is simple to measure and less sensitive to outlying values [49].Eq. ( 6) is a description of the metrics used in this research work [47,50].

( ) (
) With   is the data ground truth value for   ,  (  ) is the predicted value for a data   ,   is the number of data.
The R 2 or coefficient of determination is a statistical measure quantifying uncertainty from 0 to 1.A value of 1 shows a strong correlation between estimated and measured values [51].R 2 is given by Eq. ( 7) [34].
where,   and  (  ) are the mean of the actual and predicted values.
RMSE is the root mean square error of the prediction versus the observation.RMSE is shown by Eq. ( 8) [34].

RESULT AND DISCUSSION
This section explains the proposed method applied to two datasets started by outlier detection and removal using Isolation Forest and three other common comparison methods: MCD, LOF, and one-class SVM.
Table 2 shows the results of processing Dataset 1 and Dataset 2 using the proposed method with an Isolation Forest threshold of 0.1 and three other methods.There are 16 attributes, of which seven attributes were selected in Dataset 1. Dataset 1 has 29531 instances and 2953 outliers detected.Meanwhile, in Dataset 2, there are 14 attributes, of which eight attributes were selected.Dataset 2 has 1565 instances and 157 outliers detected.Thus, the proposed method can see outliers in both datasets using some of the attributes selected from the existing attributes.The number of outliers detected is also proportional to the size of each dataset.
The next stage, after outlier removal, is applied to the DTR to do the regression for the training and testing data.Table 3 shows the results of the analysis of two datasets, which were analyzed using four different methods mentioned previously.The three evaluation metrics are used to measure the accuracy performance of the methods used: MAE, R 2 , and RMSE.The DTR (Lightweight ML) did not show the training data accuracy parameters because there was no data from the previous research [29].
Table 3 reveals that the proposed method displayed superior outcomes compared to previous research and three other standard outlier methods.
The proposed method training accuracy parameter outperforms all other methods for all MAE, R2, and RMSE for Datasets 1 and 2. The training accuracy parameter compared to the testing data showed almost no difference for all methods, especially for the proposed method.It means that the model is not overfitting and does not yet need the L1 or L2 regularisation.
In the Dataset 1 testing result, the proposed method showed the best performance, with the lowest MAE of 21.7104 and the lowest RMSE of 33.0481, although the R2 value was 0.8095, which is not the best number among the other methods.
The standard DTR method has the highest R 2 , 0.8943, but its MAE and RMSE are higher than the proposed method.The Local Outlier Factor-DTR, Minimum Covariance Distance-DTR, and One Class SVM-DTR methods have lower R 2 values than standard DTR, indicating that integrating these outlier detection techniques only sometimes results in improvements in the context of this dataset.Moreover, their MAE and RMSE values are higher than the proposed method.
For Dataset 2, the proposed method again shows superior performance with a very low MAE of 1.679, a low RMSE of 4.6822, and a very high R 2 value of 0.9976.The proposed method indicates that it is very effective in handling this dataset.Standard DTR also shows excellent results with an R 2 of 0.9964, but its MAE and RMSE are higher than the proposed method.DTR variations integrating outlier detection techniques show reduced performance compared to standard DTR, with slightly lower values of R 2 and higher values of MAE and RMSE.
The model performance analysis concluded that the proposed method consistently performs better in both datasets than standard DTR and its variations that use outlier detection techniques, as presented in Table 3.This shows the importance of selecting and adapting the proper method for each dataset type to achieve optimal results.This result indicates that outlier removal at leaf nodes in the learning procedure enhances conventional DTR's efficacy and resolves the dataset's outlier issue.Furthermore, the findings of this experimental analysis show that the proposed method can predict the value of the AQI dataset well, so the results of this research can improve the results of previous research conducted [29].
Figure 2 shows a comparative graph contrasting the actual and predicted values of AQI Dataset 1.The X-axis indicates the sample quantity, whereas the Y-axis indicates the genuine AQI values.The plot shows the predicted values in red plus signs and the actual values in yellow lines connected by dots.

CONCLUSIONS
The DTR method integrated with Isolation Forest outlier removal in this research shows increased prediction accuracy.The DTR method with outlier removal at leaf nodes helps improve the performance of conventional DTR.Based on the discussion, outliers can affect the accuracy of regression performance.Therefore, detecting and handling outliers is essential to ensure accurate analysis results and effective machine-learning methods.
Future research could explore alternative outlier removal techniques, such as DBSCAN, DBSCAN++, kmeans, kmeans++, and other methods.The DTR can be modified using XGBOOST, which is renowned for its good performance, time complexity, and low memory requirements.Researchers must carefully consider the characteristics of the data set and the advantages and limitations of each method before choosing the most appropriate method for detecting and handling outliers.

Figure 2 .Figure 3 .
Figure 2. Actual and predicted value of AQI Dataset 1 with Isolation Forest outlier removal

Table 1 .
Feature selection

Table 2 .
Regression dataset and outliers observation

Table 3 .
The model performance result