Unifying Variable Importance Scores from Different Machine Learning Models Using Simulated Annealing

ABSTRACT


INTRODUCTION
Whatever the model, whether a statistical model or a machine learning model, researchers must be able to explain or interpret the output obtained so that problems can be solved, and goals can be achieved.Building a model includes predicting outcomes, identifying significant predictors, and identifying the model's accuracy [1].
Important predictor identification on the value of the response variable is usually one of the purposes of building models so that the model is more straightforward because predictors that cannot explain response variables need no more attention or, in other words, are excluded from the model.If a linear model is used in a study, the importance of the variable can be obtained from the value of the variable coefficient.A coefficient value that differs significantly from zero indicates that the variable is essential.In contrast, a variable coefficient value close to or equal to zero means that the variable is unimportant [2].The machine learning model has a unique approach to obtaining information on the importance of predictor variables.This approach is variable importance analysis.
There are several ways to perform variable importance analysis, including permutation feature importance [3], Shapley Additive Explanations Feature Importance [4], and others.The analysis will produce a score and rank the importance of the predictor variables.The order of scores or rankings of the importance of the variables depends on the machine learning model applied.Different machine learning models will produce different variable importance measurements even though the variable importance analysis method is the same.Several papers demonstrate this.
Darmawan works to identify significant variables characterizing the incidence of family food insecurity in Indonesia.They applied four machine learning modeling techniques followed by importance analysis using SHAP.These machine learning models are extreme gradient boost (XGB), random forest (RF), neural network (NN), and support vector machine (SVM).The dataset consisted of 24 predictors from a sample of 24,769 families [5].
Figure 1 summarizes the sequence of importance of predictor variables from various implemented machine learning models.It could be seen that the order of the importance ranks differs from one model to another.For example, house size is the most crucial predictor using the SVM algorithm.Meanwhile, that predictor is ranked ninth by XGB and even fifteenth by NN.
Each machine learning model's different variable importance measurements will make determining important and less essential predictors difficult.Therefore, variable importance measures need to be combined to facilitate interpretation [6].The users of machine learning methods will interpret importance order of a single variable importance easily.The merging methods existing are the average score of variable importance measures, variable importance with a specific weight on its rank, and ranking mode from variable importance measures.The weakness of these existing methods is that there is no objective function.The proposed variable importance will have an objective value.The objective function can measure whether the solution is still far away or near the goal.Close or achieving a goal means combining several important variables is optimal.
The proposed method utilizes a simulated annealing algorithm to find a single ranking that maximizes the minimum value of Spearman correlation between the solution and the original variable importance measures.This paper demonstrates that this approach can generate excellent results.
The proposed method can produce a variable importance which can assess contribution of predictors in predictive models.
The measures score helps to identify which predictors are influential to prediction.The joined variable importance can provide a more comprehensive and robust understanding of feature relevance, which can aid in building more reliable and interpretable machine learning models.

Variable importance analysis
It is widely known that a predictive model resulting from a machine learning algorithm is more difficult to interpret than a classical statistical model [7].The complication of the model comes from the fact that the model does not have an explicit mathematical function mapping a set of predictor variables to a single value of the response variable.The model tends to have a non-linear relationship among variables.
Due to that circumstance, a follow-up analysis is needed to reveal the information from the model [8].One of the analyses useful to reveal more information from the machine learning models is the variable importance analysis (VIA) [7].The VIA tries to reveal each predictor variable's contribution in determining the model's performance to predict the output of new observations [9].The predictor variable with a higher contribution would have a higher ranking among the predictors [8].
Several VIA methodologies can be found in the literature.In general, the VIA procedures produce a set of scores from which the analyst can identify the relative importance of each predictor.Some existing methods are permutation variable importance, Shapley Value, and SHAP.The permutation variable importance [10] has been a methodology that is widely implemented in many situations.This approach could be applied in models generated by any machine learning algorithm such as NN, RF, XGB, and SVM.The other advantage of this methodology is that it requires a reasonably fast computational time.
Suppose we have a predictive model F, which utilizes p predictor variables X1, X2, …, Xp to predict a variable response Y. Further, suppose that the matrix of the predictor variable values is X and the vector of variable response is y.The performance of the model F could be measured by a loss function L, which is obtained by comparing the actual response vector y and the prediction response vector  ̂= ().The variable importance score of a predictor variable Xj is obtained by the following steps [9]: (1) Calculate  0 = ℒ( ̂, , ); this is the value of the loss function in the original data.
(2) Create a matrix  *  by permutating the j th column of X.
(5) Calculate the importance score of   by   =  *  −  0 or   =  *   0 .In a classification problem, the loss function might be 1 -AUC or misclassification rate.A variable with a high value of sj indicates that Xj is more important because the exclusion of this variable from the model will increase the loss function so that the model performance is low.
Unfortunately, the result of sj score may vary when we change the machine learning algorithm to produce model F.Even if the dataset used is identical, the importance rankings might differ between those obtained from the random forest methodology and those obtained from other machine learning algorithms.This circumstance could raise an issue in deciding which subset of variables should be considered significant.This paper will discuss the proposed approach to ease analysts' decision-making.

Notations
Suppose we ran k different machine learning algorithms to generate k predictive models.For each model, we implemented the permutation variable importance procedure to have k sets of variable importance rankings Ri, i = 1, …, k.The set Ri consists of p values or Ri = {si1, si2, …, sip} where sij is the rank of the variable importance for predictor variable Xj, j = 1, …, p.Note that sij should be a whole number between 1 to p or sij Î {1, 2, …, p} and sij ≠ sik for all j ≠ k.
Recall that instead of the variable importance score, the value of sij is the rank of the importance score descendingly.A variable with a smaller sij is considered the more important predictor since it is associated with a higher value of variable importance score.
Our proposed method aims to obtain a single set of variable importance ranks from those k sets.In other words, we would like to unify the k sets into one.Let us use the notation of W = {w1, …, wp} as the result of the unifying process.As wj is the importance rank of predictor variable Xj, it also has properties like what sij has.The value of wj should satisfy the conditions wj Î {1, 2, …, p} and wj ≠ wk for all j ≠ k.
The basic idea of the unification is finding wj; j = 1, …, p so that the correlation between W and Ri is as high as possible for all i = 1, …, k.In other words, we would like to find a new variable importance rank W to agree highly with all original sets of importance rank Ri's.Suppose that zi is the correlation between W and Ri, zi = cor(W, Ri) for i = 1, …, k.The proposed method tries to reach a situation so that zi is high for all i = 1, .., k.To ensure that, we define Z = min {zi} (1) and use it as the objective function of the maximization problem.
The complete formulation of the optimization problem is then could be written as follows: Find W = {w1, …, wp} that maximize Z Subject to constraints: wj Î {1, 2, …, p}; for all j = 1, …, p wj ≠ wk for all j ≠ k The above optimization problem could be seen as a combinatorial problem since the solution is the permutation of p whole number from 1 to p. Therefore, we propose implementing a simulated annealing algorithm as a metaheuristic approach to handle this optimization problem.The following subsection describes the details of the algorithm.

Computational procedure
As mentioned, a simulated annealing algorithm is utilized to find the solution for the optimization problem to unify the importance ranks.The basic algorithm is described as follows.
At the initiation step, we define the values of Tmax and Tmin.Both values determine the number of iterations to improve the solution.Then, an initial solution is generated.In our method, a random permutation of whole numbers 1 to p will be improved during the process.Initially, we define the value T = Tmax [11].
The iteration starts by generating a new solution slightly different from w.It can be generated by exchanging the values of two entries in w and named as w'.The positions of the changed entries were randomly chosen.The process continues by replacing w by w' whenever the performance of w' is better than w [12,13].In this paper, the performance of the solution is Z, the minimum correlation between w and Ri.The higher the Z value, the better the solution.
The reason behind the choice of maximizing the minimum value of correlations is as follows.A good solution of feature ranking is one that has a high correlation with every single result from machine learning algorithms.If the minimum correlation value is large, then all correlation values are larger.Therefore, maximizing minimum correlation means that the algorithm would yield high values for all correlation coefficients.
The second condition to replace w by w' is a random process.Even though Z(w') is lower than Z(w), there is a possibility that w' replaces w with the probability is inversely proportional to the number of iterations and the difference between the performance.It means that if the iteration has gone quite long, the probability of replacing the current solution with the worse one is lower.Also, if the performance of the alternative solution is much worse, then the probability is much lower.
As the iteration goes, the value of T is decaying.The algorithm will stop if the value of T reaches Tmin [14], and the most recent w is the final solution containing the set of the importance ranks.
The pseudocode of the algorithm can be written as follows [15]:

SIMULATION STUDY
This section discusses a study to evaluate the performance of the proposed method to obtain a single set of variable importance ranks.The evaluation was performed through a simulation with sufficient replications to draw a conclusion [16].

Design of simulation
A simulation study examined the quality of the variable importance ranks resulting from the proposed method.The general strategy of the study was comparing the variable importance ranks to the actual rank.To do that, the simulations start with a data-generating step.
The generated datasets contain p predictors and one response variable so that the contribution of each predictor differs from one to another.The second step was the modeling process using machine learning algorithms and continued by permutation variable importance analysis based on the resulting models.It was then continued with the proposed unification process using a simulated annealing methodology.In the last step, we evaluated the result by examining the level of agreement between the actual order and the order of importance obtained by the proposed approach.
The working step of the simulation is as follows, repeated a hundred times: (1) Step 1 generates datasets.The dataset contains 24,769 observations with 24 predictor variables and one binary response variable.The contribution of each predictor was designed so that X24 is the highest, X23 is the second highest, and so on until X1 has the lowest contribution.This situation is considered the order of the variable importance ranks and would be used as the benchmark to evaluate the result.To reach those properties of the dataset, the detailed procedure of data generating is as follows: -Generate randomly 1,000 observations of 24 predictor variables having a standard normal distribution -For each observation, calculate q and p  =  0 +  1  1 +  2  2 + ⋯ +  24  24 (2) [ 0 , -Generate binary response variable y ~ Bernoulli(p) (2) Step 2 builds modeling and variable importance analysis.For each dataset, some supervised machine learning algorithms were implemented.The set of algorithms includes RF, XGB [17], NN [18], and SVM [19].Once the predictive model was obtained, a permutation variable importance analysis was performed.At the end of this step, we had four sets of variable importance ranks, one for each machine learning model.The classification problem in machine learning can be calculated by a scikit-learn library in Python [20].
(3) Step 3 is unification of the variable importance ranks.In this step, the proposed method was applied to generate unified ranks.A simulated annealing procedure was conducted to find the optimal solution for each dataset.
(4) Step 4 evaluates the results.To evaluate the quality of the results, Spearman's rank correlation was calculated to examine the agreement between the unified variable importance rank and the actual order of importance.The Spearman correlation is preferred that Pearson correlation because we focus on the agreement level not the linear pattern.We compared that correlation with the correlation between the original variable importance ranks and the actual order.If the correlation of the unified ranks is higher, we could conclude that the methodology works well in obtaining better results.
Notice that in this simulation study, we varied the degree of the relationship among predictor variables.There are five different scenarios of correlation values among predictor variables .The values of  are 0, 0.35, 0.70, 0.80, and 0.90.The scenario with  =  represents the situation of a dataset having independent predictors, while  =  represents a dataset with low correlated predictors.The other correlations represent high correlation.

Simulation results
Table 1 summarizes the general results of the simulation.Each table cell contains the median value of correlation between the ranks, which are from the machine-learning algorithms and the ground-truth ranks.It also shows the quantile (0.10) and the quantile (0.90) of the correlation values, which are presented in the parentheses.Several issues could arise from these results.First, if we compare the four machine learning algorithms, the permutation variable importance score generated from neural network algorithms is the best if the correlation between predictors is moderate.When the correlation among predictors is high, appointing which algorithm is better is difficult.
Second, we learned that as the degree of the relationship among predictor variables increases, all algorithms tend to face difficulty reaching good performance.It is seen that the median value decreases as the r increases.It means that when the predictor variables are highly correlated, the permutation variable importance methodology was unable to identify the importance order correctly.
Third, combining the importance score ranks using the simulated annealing technique could slightly improve the result.For the scenario with a very high degree of correlation among predictors (r = 0.70, 0.80, and 0.90), the proposed method could achieve higher median values than all singlealone machine learning algorithms.
A special note should be given to the simulation scenario with r = 0.90.The quantile (0.10) of the importance-ranking correlations is negative for all machine learning algorithms.However, the negative value was avoided when the proposed approach was implemented.
In addition to comparing the result in general, we also compare the performance of our proposed method using a oneon-one comparison for each replication.Recall that we ran as many as 100 replications.Table 2 presents the number of replications that the proposed approach outperformed the single ML methodology.If the numbers are greater than 50, we conclude that the proposed approach is better for generating the predictor importance rankings.The proposed method can perform better for all scenarios than RF, XGB, and SVM.It is also better than a NN when the relationship degrees among predictors are high ( equal to or greater than 0.70).Again, it emphasizes and strengthens the previous discussion that unification of variable importance using simulated annealing could provide more meaningful results to identify the importance of the predictors from machine learning models.
Based on the simulation, we conclude that (1) the proposed method is useful to unify several VIMs into a single variable importance ranking, and (2) the unified ranking performs better than any single machine learning result, especially when the predictors are highly correlated.

Empirical data
The data comes from the 2020 national economic survey (Susenas) in West Java, Indonesia.The number of predictor variables is 24, and one response variable has two classes: 0 indicates a food-secure family, and 1 indicates a food-insecure family.The number of observations is 24,679 families.The names of the predictors can be seen in Table 3.

Variable importance analysis for empirical data
Permutations variable importance (PVI) methods are applied to FIES data using four models and obtained variable importance measures in Figure 2. PVI is prioritized because of simplicity and ease of interpretation.The PVI is conducted through a hyperparameter tunning technique to obtain optimal results [21].RF, XGB, NN, and SVM are applied because the models are suitable for classification data, the models can handle relationship non-linear between predictor variables and response variable, and they are used widely because of their effectiveness.The order of the variable importance from RF machine learning is X1, which is the most important; X2, the sixth most important; and so on.For XGB variable importance (VI), X1 is in first place, X2 is in second place, and so on, which differs from the results of RF VI.For NN, the ranking of the importance of variables is the same as those of RF except for variables X6, X8, X11, X13, X14, X16, X18, and X19.Most of the VI order from SVM differs from the order of VI from other VI.There are differences in variable importance measure (VIM) in each machine learning model, as in the case of the SHAP-FI analysis in previous studies [5].This VIM distinction makes it difficult to interpret.Therefore, a merger method is proposed in this study.The order of the variable importance for each machine learning method produces the joint VIM.The joint variable importance showed a strong Spearman correlation with the machine learning models RF, XGB, NN, and SVM, scoring 0.953; 0.923; 0.936; and 0.926, respectively.The five most important predictor variables in influencing food insecure families are X3 (education of the family head), X1 (house size), X4 (No of my family members have a savings account), X5 (Floor-type of the house), and X7 (Ownership of land).(Table 4).

Evaluation of the proposed method with empirical data
The joint variable importance scores can be seen to describe the order of influence of variables.The joint variable importance score is calculated by averaging the scores of the four PVI (Figure 5).Based on the average, X1, X2, X3, X4, X5, X6, X7, X8, and X9 have scores greater than 0.03.A score of 0.03 means the distance between the loss function score and the original data loss function score is 0.03.In other words, the predictors significantly influence the response variable.It works like the elbow method in cluster analysis [23].If predictors.Next, characteristics of VI-SA are identified through the number of predictors.
A boxplot assesses the SA algorithm's stability in creating VI (Figure 6).The object value (minimum correlation) increased and became more stable when more variables were included.This study looks at a variety of variables, ranging in number from few to numerous.The variables included 5, 10, 15, 20, and 24.In conclusion, various factors can improve the accuracy of the joint variable's important measure.This study examines several different variables.The proposed variable importance measure is seen through individual classification tree algorithms.The classification tree is a method to predict output [24].The accuracy of the joint VIM increases when the number of variables increases.
The accuracy is quite good when the number of predictors exceeds 10 (Figure 8).The accuracy shows that the proposed method is suitable.In summary, results of its application to empirical data support that it can be applied, and results of evaluation is satisfied.The machine learning and joined variable importance is optimal when predictors are independent each others.The constituent variable importances should have high accuracies so the proposed variable importance has high accuracy.The optimal accuracy is gotten when the number of predictors exceeds 10 (ten).The proposed variable importance have uncertainty characteristics.

CONCLUSION
This proposed variable importance measure (VIM) is a solution if there are several variable importance measures of the machine learning model and a decision about the influence of the predictors will be determined.The method is optimal if the predictors are independent each other, the number of predictors more than 10.The method has uncertainty characteristics and high accuracy.This joint VIM makes it easy to identify rank of predictors which influence on the response variable.The proposed method should be used to dataset which have independent predictors each other.Constituent variable importance measures have optimal hyperparameter.The VIM can be better if the number of predictors is more than ten (10).The three most influential predictors of family food insecurity in FIES data are the education of the head of the family (X3), the house size (X1), and the number of family members who have an account (X4).
Novelty of the research is a method to joint several variable importance with high accuracy.In addition, this method has a chance to be developed or modified.The modified parts include the form of the objective function and the strategy of changing the solution on the simulated annealing algorithm.

Figure 4 .
Figure 4. Iteration to obtain the proposed optimal combined variable importance measure

Figure 5 .Figure 6 .
Figure 5. Scores of average variable importance of random forest, XGBoost, Neural Network, Support Vector Machine

Figure 7 .Figure 8 .
Figure 7. Patterns for ranks of each predictor with 100 replications

Table 1 .
The median and the 80% range of correlation values between actual and predicted variable importance ranks

Table 2 .
The percentage of replications that the proposed method outperforms variable importance rank based on four different machine learning algorithms

Table 3 .
Predictors, symbol, and measure in Food Insecure Experience Scale (FIES) data

Table 4 .
Machine-learning and simulated annealing (SA) variable importance