Leveraging Support Vector Machine for Predictive Analysis of Earned Value Performance Indicators in Iraq's Oil Projects

ABSTRACT


INTRODUCTION
Oil and gas construction projects play a pivotal role in facilitating production processes within the industry [1].However, these projects frequently grapple with protracted risks, resulting in extended timelines, elevated costs, and compromised quality, thereby undermining their success [1].The inherent complexity of technology and management within the oil and gas sector renders these projects among the most challenging to execute.For project managers, in addition to possessing relevant experience, adherence to a consistent reference framework predicated on the continual monitoring and evaluation of all formal project stages is essential [2].Effective management within the oil and gas industry necessitates robust strategies for time, cost, and quality, thereby underscoring the need for techniques to mitigate the risk of future project failures [2].
EVM, a prevalent approach for project monitoring and control, facilitates project progress analysis by measuring scope, schedule, and cost [3].Despite its benefits, the current methodologies and strategies for estimating earned value indexes in Iraq are deemed subpar and inefficient.The demand for novel and advanced technologies that enable the timely, accurate, and flexible estimation of earned value indexes has thus intensified [4].Given the absence of an established modern methodology for estimating the earned value of Iraq's oil projects, this study primarily aims to formulate three mathematical models, leveraging the Support Vector Machine, to predict the key indicators of earned value in the construction of the Karbala Refinery Project.These performance indicators are CPI, SPI, and TCPI.

LITERATURE REVIEW
Numerous researchers have explored employing SVM techniques for project management, focusing specifically on maintaining cost and timeline control.For instance, a SVM model was developed by Hasan et al. to estimate the cost of road projects, utilizing 43 sets of bills of quantity collated from Baghdad, Iraq [5].The prediction equations formulated within this model demonstrated a robust performance in estimating construction costs for roads in Baghdad city, posting an average accuracy (AA) of 99.65% and a coefficient of determination (R² ) of 97.63% [5].
Similarly, Juszczyk developed a model founded on machine learning and SVM techniques to predict site overhead costs, with the results affirming its effectiveness [6].Alawadi et al. proposed an SVM-based model to furnish preliminary budget estimates for bridge construction, using basic data and metrics about bridges in the initial construction stages as input [7].The forecasts derived from this model exhibited an acceptable estimation error range of 25-30%, indicating reasonable accuracy [7].
Additionally, a mathematical model was developed to predict the optimal time of completion for repetitive construction projects [8].The constructed model, which leveraged SVM techniques, demonstrated a significant capability to predict the time of repetitive construction projects (RCPs), with a correlation coefficient of 97%, a mean absolute error (MAE) of 3.6, and a RMSE of 7% [8].
Eltoukhy and Nassar employed SVM to develop a model for predicting cost and time overruns in construction projects, by elucidating the causes and effects of cost and schedule overruns in building projects [9].In 2021, Chandanshive and Kambekar developed a cost prediction model to enable accurate cost predictions early in a project's lifecycle [10].The resultant SVM model for cost prediction in building construction projects exhibited a correlation coefficient (R) of 97.5257% and an R² of 94.299% between the actual and expected cost, with the overall accuracy defined as 94.29%.The mean absolute percent error (MAPE) of 8.96% signified that the model's percentage error met the error requirements [10].
Notably, Susilowati and Kurniaji integrated EVM methodology into a development project encompassing malls and hotels, measuring performance through indicators such as the cost performance index (CPI), and the schedule performance index (SPI) [11].Hussien and Jasim proposed a tool that melds the building information modelling (BIM) technique with EVM, offering several features that assist project managers in circumventing errors during project progress stages by identifying conflicting elements that induce time delays and cost deviations [12].

METHODOLOGY
To determine the factors that affect and develop mathematical equations to quickly and readily determine the indexes, the following steps can be used to achieve this goal: • Identifying the AI technique variables that have an impact on the EV indices in Iraqi oil projects.
• Creating mathematical models that may be applied to estimate the cost performance index (CPI), schedule performance index (SPI), and to-complete cost performance indicator (TCPI) in Iraqi oil projects before execution phases.
• Developing equations for calculating the SPI, CPI, and TCPI for the oil projects.
• Verifying and validating their developed mathematical models allows them to test the efficiency and accuracy of the results.Figure 1 shows the methodology for development of SVM models.

EXPERIMENTAL WORKS (CASE STUDY)
The project of Karbala Refinery is selected as a case study to achieve the goal of the research being one of the huge projects, project Location is 25 km South of Karbala City, Iraq (100 km South of Baghdad City).The total site area is (10 km 2 ) including the Refinery area is (6 km 2 ).The total cost of this project is about (USD 6,641,089,012), with a working duration of (54 months).More information is summarized in Table 1. Figure 2 displays the units' diagram for the Karbala Refinery project as well as Figure 3 shows a 3D picture and the site photo of the Crude & Vacuum Distillation Unit.

PREPARATION OF DATA
All reports of the Karbala refinery project have been obtained from the Karbala Refinery Project Authority, the state company for oil projects (SCOP), Iraqi Ministry of Oil, which (83) reports.(73) reports from it used for building the (SVM) models and, (10) used for generalization.For each one of the three models (CPI, SPI, TCPI), the data is separated into three categories (training, testing, and validation).The CPI model got 78% of the data in the training set, 11% in the test set, and 11% in the validation set.As a result, (57) reports were used for training, (8) for validation, and (8) for testing this model.While the SPI model received 70% of the data from the training set, the test set received 5% and the validation set received 25%.As a consequence, (51) reports were used for training, (18) for validation, and (4) for testing.As for the TCPI model the optimal division was found to be 84% for the training dataset, 5% for the testing dataset, and 11% for the validation datasets, as a consequence (61) reports were used for training, (8) for verification, and (4) for testing.The precision of all these divisions was based on the lowest testing errors and highest Correlation Coefficients (r) value.

CHOOSING A SUITABLE SUPPORT VECTOR MACHINE SOFTWARE
Today, support vector machine applications are used for solving many engineering problems such as earned value predicting.The researcher studied many support vector machine programs such as Win SVM, MATLAB SVM Toolbox, LIBSVM, SVM light, STATISTICA, DTREG and WEKA.The present study made use of the WEKA Software because of the researcher found that the best software for support vector machine, which is easy to apply and has a high compatibility with both simple and more complex issues, and can accept different kinds of variables and factors.Weka is a set of state-of-the-arts ML algorithms and data pre-processing tool, which was developed by Waikato University, New Zealand.WEKA is short for Waikato environment for knowledge analysis, and its design enables a flexible and easy check of currently applied methods on data sets.Third edition of Weka was used in this study since the best stable version of Weka.There are five steps involved for the implementation of SVM using Weka, namely Data Division, SMOreg, function selecting, kernel selection, and determining the learning SVM parameters.As for the work presented in this chapter, the lowest root mean square error (RMSE) was adopted based on the Kernel and SVM parameters (C and epsilon).

IDENTIFICATION OF THE VARIABLES FOR SVM MODELS
The SVM model requires lots of information and data, which was collected from the Karbala Refinery Project for the period from 2015 to 2022.The data collection method used in this paper is direct data collection, as the project data was obtained from the Karbala Refinery Project Authority after many approvals, interviews and repeated visits to the project.Despite the fact that this method is somewhat complicated, a sufficient amount of reliable data has been collected from documents and reports on the planning and implementation of the refinery.Historical data contained both dependent and independent variables that were chosen and identified from (83) reports for the Karbala Refinery project.Three variables have been identified as being dependent, namely the Cost Performances Indicator (CPI), Schedule Performances Indicator (SPI), and To-Complete Cost Performances Indicator (TCPI) and six variables were chosen as being independent as following: • BAC: is the budget at completion; • ACWP: is the actual cost of the work performed, AC; • A%: is the real percentage; • BCWP: is the budgeted cost of the work performed, EV; • P%: is the planning Progress percentage; • BCWS: is the budgeted cost of the work scheduled, PV.
The variables used in the SVM models that affect the EV index are shown in Table 2. To guarantee ensures all variables receive the same attention throughout training, the input and output variables are pre-processed by scaling them to eliminate their dimension.As part of this approach, scaled values with a minimum and maximum of (x min/x max) are computed for each variable in Eq. ( 1):

DEVELOPMENT SVM MODELS
It is necessary for SVM models to be organized, so as to improve the performance.The main factors to be addressed are developing model input, data divisions, and preprocessing, developing the model architecture, and its optimization (training), stop criteria, and model validation.A structured methodology is applied to develop the model, which involves four major phases: Model architecture 3.
SVM model validity The same variables defined at the data identifying step are used for developing the three mathematical SVM models, using the project characteristics to predict the EV indexes.Through the third version of Weka, the following models were developed: • SPI: is the schedule performance indicator • CPI: is the cost performance indicator • TCPI: is the to-complete cost performance indicator

Cost performance index (CPI) model
This model's development involves the following five steps:

Data division CPI model
The data is divided into three sets, namely training, testing, and validating sets as shown in Table 3.  4. The poly-kernel was chosen as the optimal kernel for the CPI model because its root mean square error (RMSE) is equal to 0.1, which is the least number found.5 illustrates an example of the C effect on the CPI model.When the C value is 10 the greatest (r) value is (96.98%), the mean absolute error (MAE) is (0.0354), and the least RMSE is (0.0786) making this the ideal value.According to the statistics in this table, changes in C, especially those falling between (1 to 10), have little impact on the performance of the CPI models.This supports including it in the study's suggested model.Considering that the greatest (r) value is (87.06%), the smallest RMSE is (0.0744), and the mean absolute error (MAE) is (0.0528) Epsilon is considered to be at its best when it is valued equal to (0.006).The information in this table demonstrates that variations in Epsilon have little impact on the functionality of CPI models, especially when they fall between (0.001 to 0.05).This is in favor of including it in the study's model.7 shows the connection weights collected by the Weka software for the optimal CPI model.A scale is not necessary because the program decides whether or if the data should be transformed as well as the method of transformation.
CPI act = CPI nor × range + min CPI act = CPI nor × 0.5650 + 1.0340 The above-mentioned equation's implementation can be clarified by utilizing the data used in the SVM model training for CPI, as shown in report no.66 in Table 8.The predicted value obtained from the above equation equals (1.071), which, when compared to the real value measured by hand (CPI=1.064), is relatively accurate.These value variations are considered as minor.Before utilizing Eq. ( 2) it should be noted that all input variables must be transformed to values between (0-1) because Eq. ( 2) was built using Eq.(1).To obtain actual data out from normalized ones, conversions to actual values were made using Eq. ( 4) and Table 3.  8 summarizes and compares the CPI computation using SVM to validate the estimation model.It comprises the real CPI value received from the Karbala Refinery Project, as well as the estimated CPI value calculated using the SVM equation (as obtained from Weka V.3).

Data division SPI model
The data is divided into three sets, namely training, testing, and validating sets, as show in Table 9.  10.The poly-kernel was chosen as the optimal kernel for the SPI model because its root mean square error (RMSE) is equal to (0. 0788), which is the least number found.11 illustrates an example of the C effect on the SPI model.When the C value is 6 the greatest (r) value is (96.98%), the MAE is (0.0354), and the least RMSE is (0.0786) making this the ideal value.According to the statistics in this table, changes in C, especially those falling between (1 to 10), have little impact on the performance of the SPI models.This supports including it in the study's suggested model.Table 12 displays Epsilon's impact on the SPI model.Considering that the greatest (r) value is (97.35%), the smallest RMSE is (0.0717), and MAE is (0.0478), Epsilon is considered to be at its best when it is valued equal to (0.006).The information in this table demonstrates that variations in Epsilon have little impact on the functionality of SPI models, especially when they fall between (0.001 to 0.05).This is in favor of including it in the study's model.The above-mentioned equation's implementation can be clarified by utilizing the data used in the SVM model training for SPI, as illustrated in report no.56 in Table 14.The predicted value obtained from the above equation equals (0.954), which, when compared to the real value measured by hand (SPI=1.034), is relatively accurate.These value variations are considered as minor.Before utilizing Eq. ( 5), it should be noted that all input variables must be transformed to values between (0-1) because Eq. ( 5) was built using Eq.(1).To obtain actual data out from normalized ones, conversions to actual values were made using Eq. ( 7) and Table 9.Table 14 summarizes and compares the SPI computation using SVM to validate the estimation model.It comprises the real SPI value received from the Karbala Refinery Project as well as the estimated SPI value calculated using the SVM equation (as obtained from Weka V.3).
Figure 6 illustrates the predicted values against the actual values for the verification data to show the capacity of the SVM model for SPI to assess the model.It is clear from this figure that the (R 2 =96.5%).Figure 7 shows the generalization results for the SPI model with it can be said are excellent.

Data division TCPI model
The data is divided into three sets, namely training, testing, and validating sets, similar to the previous network models as illustrated in Table 15.16.The poly-kernel was chosen as the optimal kernel for the TCPI model because its RMSE is equal to (0.081), which is the least number found.17 illustrates an example of the C effect on the TCPI model.When the C value is 1 the greatest (r) value is (94.12%), the Mean Absolute Error (MAE) is (0.0422), and the least RMSE is (0.081), making this the ideal value.According to the statistics in this table, changes in C, especially those falling between (1 to 10), have little impact on the performance of the TCPI models.This supports including it in the study's suggested model.
Table 18 displays Epsilon's impact on the TCPI model.Considering that the greatest (r) value is (94.26%), the smallest RMSE is (0.0778), and the Mean Absolute Error (MAE) is (0.0489), Epsilon is considered to be at its best when it is valued equal to (0.04).The information in this table demonstrates that variations in Epsilon have little impact on the functionality of TCPI models, especially when they fall between (0.001 to 0.05).This is in favor of including it in the study's model.

Equation of TCPI model
The TCPI value might be estimated using the connection weights and threshold level shown in Table 19.
By using the connection weights and threshold level stated in Table 18, the value of TCPI might be forecasted as follows: TCPI act = TCPI nor × range + min TCPI act = TCPI nor × 0.8020 + 0.1960 The above-mentioned equation's implementation can be clarified by utilizing the data used in the SVM model training for TCPI, as shown in report no.66 in Table 20.The predicted value obtained from the above equation equals (0.440), which, when compared to the real value measured by hand (TCPI=0.591), is relatively accurate.These value variations are considered as minor.Before utilizing Eq. ( 8), it should be noted that all input variables must be transformed to values between (0-1) because Eq. ( 8) was built using Eq.(1).To obtain actual data out from normalized ones, conversions to actual values were made using Eq. ( 10) and Table 15.

Verification of the TCPI model
Table 20 summarizes and compares the TCPI computation using SVM to validate the estimation model.It comprises the real TCPI value received from the Karbala Refinery Project, as well as the estimated TCPI value calculated using the SVM equation (as obtained from Weka V.3).  Figure 8 illustrates the predicted values against the actual values for the verification data to show the capacity of the SVM model for TCPI to assess the model.It is clear from this figure that the (R 2 =87.1%).The results of generalization from the TCPI model with (R 2 =87.53%) is an excellent as shown in Figure 9.

VALIDATION OF THE SVM MODELS
Mean percentage error (MPE), root mean squared error (RMSE), mean absolute percentage error (MAPE), average accuracy percentage (AA%), coefficient of determination (R 2 ), and coefficient of correlation (R) are the statistical measures most frequently used to assess the model's accuracy [12,13].Table 21 clearly shows the comparative study's outputs in terms of results.The average accuracy (AA%) for the CPI, SPI, and TCPI was 96.093%, 91.709%, and 66.024%, respectively, while the correlation coefficients (R) were 92.8%, 98.2%, and 93.3%.As a result, the models' consistency with actual data was very outstanding.

CONCLUSIONS
Construction oil and gas projects especially Refineries are very important today in Iraq because they help with and support the operation and production process.However, these projects have had significant cost overruns, time overruns, and poor quality, which has hurt their success and is a major concern for the industry, therefore this study's concept came to forecast the earned value indices schedule performance index (SPI), cost performance index (CPI), and to-completion cost performance indicator (TCPI) of implementing projects of refineries using support vector machine.In this study, three models were used, with six variables as inputs which is the BAC, ACWP, BCWP, BCWS, A% and P%.Three equations were found.Many significant findings are shown by the research, including average accuracy (AA%) for the CPI, SPI, and TCPI was 96.093%, 91.709%, and 66.024%, respectively.The correlation coefficients (R) were 92.8%, 98.2%, and 93.3% and the root mean squared error (RMSE) was 0.0969, 0.0604, and 0.2260 respectively.It is noteworthy that the findings of this study serve as a crucial marker and a further prediction guide for forecasting the success of oil projects.The limit of this research is to forecast performance measurement for oil projects, especially refineries.Future studies should concentrate on estimating the performance of different project types utilizing more additional inputs or another artificial methodology like genetic algorithms, dynamic programming, and others.

Figure 2 .
Figure 2. Units diagram of Karbala Refinery Project

Figure 3 .
Figure 3. Crude and vacuum distillation unit

Figure 4 .
Figure 4. Comparison of predicted and actual for CPI model Figure 4 illustrates the predicted values against the actual values for the verification data to show the capacity of the SVM model for CPI to assess the model.It is clear from this figure that the (R2=86.1%).The ten spare data that have not

Figure 5 .
Figure 5. Generalization of CPI model 8.2 Schedule performance index (SPI) model This model's development involves the following five steps:

Figure 6 .
Figure 6.Comparison of predicted and actual for SPI model 8.2.5 Verification of the SPI modelTable14summarizes and compares the SPI computation using SVM to validate the estimation model.It comprises the real SPI value received from the Karbala Refinery Project as

Figure 7 .
Figure 7. Generalization of SPI model 8.3 To complete cost performance indicator (TCPI) model This model's development involves the following five steps: 8.3.1 Data division TCPI modelThe data is divided into three sets, namely training, testing, and validating sets, similar to the previous network models as illustrated in Table15.

Figure 8 .Figure 9 .
Figure 8.Comparison of Predicted and Actual for TCPI Model Figure 9. Generalization of TCPI Model

Table 1 .
Project background information

Table 2 .
Variables of SVM models

Table 3 .
Impact of data split on performance of CPI model

Table 4 .
Effects of the kernel function on CPI model

Table 5 .
Impact of changing parameter C on CPI model performance

Table 6
displays Epsilon's impact on the CPI model.

Table 6 .
Impact of changing parameter Epsilon on CPI model performance

Table 7 .
Weight estimates for model CPI

Table 8 .
Verification of CPI model

Table 9 .
Impact of data split on performance of SPI model Selection kernel SPI modelThe next step is to choose the kernel, as displayed in Table

Table 10 .
Effects of the kernel function on SPI model

Table 11 .
Effect of the parameter C in SVM model performance

Table 12 .
Impact of changing parameter Epsilon on SPI model performance Equation of SPI modelThe SPI value might be estimated using the connection weights and threshold level shown in Table13as follows:

Table 13 .
Weight estimates for model SPI

Table 14 .
Verification of SPI model

Table 15 .
Impact of data split on performance of SPI model

Table 16 .
Effects of the kernel function on TCPI model

Table 17 .
Impact of changing parameter C on TCPI model performance

Table 18 .
Impact of changing parameter Epsilon on TCPI model performance

Table 19 .
Weight estimates for model TCPI

Table 20 .
Verification of TCPI model

Table 21 .
The outputs of the validation study for CPI, SPI and TCSPI-SVM models