Predicting Global Energy Consumption Through Data Mining Techniques

With the explosion of the global population and technological progress


INTRODUCTION
The extensive implementation of smart meters has boosted the quantity available of electricity consumption data, making it a part of Big Data.The emphasis on integrating demand flexibility into the grid system has shifted towards buildings as they are responsible for a significant portion of energy usage, comprising 31% of final energy consumption and over 55% of global electricity demand.Demand response is the adaptation of consumers' energy consumption patterns in response to changes in the energy system, either by a change in behavior or through automation [1].The tremendous among of energy used by buildings and the rapid increase in the amount of energy buildings need is pushing the creation of techniques to make buildings consume less energy [2].The objective of this study is to provide a thorough overview of electricity consumption from different countries around the world using data mining techniques.In addition, it aims to uncover patterns, correlations, and rules in the power usage data that can shed light on energy consumption patterns and trends.The analysis will be performed using various data mining techniques, such as simple K-Means and Expectation Maximization (EM) due to their promising nature in solving similar types of problems found in the literature.The analysis is comprised of two types of time frame factors.Firstly, the algorithms are trained over the past or historic values obtained over a specified timeframe.Secondly, based on the learned patten, future consumption trends are adequately predicted considering the already extracted knowledge.This is mainly the process of prediction using data mining and machine learning [3].
The significance of the proposed study is crucial for the energy sector especially amid the emergence of information and communication technologies (ICT) that cause a huge demand in the energy sector.For instance, cloud servers in cloud computing technology demands huge energy on 24/7 basis to perform the daily operations, demanding sustainable solutions [4][5][6].The results of the analysis will be presented in a form that can provide a valuable resource for decisionmakers and stakeholders in the energy sector, as it will provide a deeper understanding of electricity consumption patterns and trends, which can inform energy planning and policy.For instance, based on the outcomes of the study, the concerned authorities can plan sustainable demand response strategies at earlier stages to prevent unsolicited circumstances such as power failure [7].
The rest of the paper is organized as follows: Section 2 contains a review of the related literature.Section 3 highlights the investigated methodology in detail.Section 4 presents the experimental results conducted over the dataset while Section 5 concludes the paper.

LITERATURE REVIEW
Corten et al. [8] discussed the use of data mining techniques to optimize the energy performance of buildings.The authors argue that data mining can be used to identify patterns and relationships in energy consumption data, which can then be used to improve the energy efficiency of buildings.To demonstrate the effectiveness of this approach, the authors present a case study in which data mining techniques were used to analyze energy consumption data from a residential building.The results of the analysis showed that data mining was able to identify patterns and relationships in the data that were not apparent from traditional statistical analysis.The authors also discuss the potential benefits of using data mining for energy performance optimization, including the ability to identify and address energy waste, reduce energy costs, and improve the overall energy efficiency of buildings.Overall, the paper presents a compelling case for the use of data mining techniques to optimize energy performance in buildings.The authors provide a thorough review of the literature on the topic and present a well-designed case study to demonstrate the effectiveness of their approach.However, it would be helpful to see additional case studies or larger-scale studies to further validate the findings of this work.
Aki et al. [9] used data mining techniques to analyze realtime electricity data.They divided their work into two main phases, phase one containing data collection, data preprocessing, data understanding, finding k value, and clustering.Phase two is all about the FP-growth Algorithm and the result.The dataset is gathered from six houses in real-time.In the data preprocessing phase, they perform four techniques, data cleaning, data integration, data transformation, and data reduction.Additionally, they used the box plot to represent the dataset in the data understanding phase.Then they used the elbow method to find the number of clusters which is K.After that they apply the clustering, each cluster is assigned a color in the cluster plot.Also, they convert their data to binary values so they can apply the FP-growth algorithm to find frequent patterns.
Sharma [10], in her research paper she wants to produce energy forecasting data mining model.By using data mining techniques her goal is to define the relationship between energy consumption and the weather.They collected data from Australian energy company called Ausgrid and open power system data.She used Machine learning algorithm, the data sets spited into testing and training, to avoid underfitting and overfitting they used a cross validation method.In the preprocessing phase, to avoid outliers and continuously measure the accuracy they used anomaly detection methods like support vector anomaly detection, and clustering-based anomaly detection, and density-based anomaly detection.Finally, the outcomes from anomaly detection methods were used for developing unified model that can provide more accuracy for short term, and medium term and long-term forecasting.
Ebrahim and Mohammed [11] introduced an innovative methodology to enhance household demand forecasting based on energy disaggregation for Short Term Load Forecasting.They applied five data mining disaggregation algorithms in their study by utilizing data mining techniques, and they also used feed-forward artificial neural networks to approach and pre-process the energy disaggregation process.They started using five distinct methods of energy disaggregation for a set of data pertaining to two separate houses.Denoising autoencoder (DAE), RECTANGLES, recurrent neural network (RNN), factorial hidden Markov model (FHMM), and combinatorial optimization (CO) were applied and analyzed in various combinations.Then they utilized it to test the used energy disaggregation approaches and provide a thorough evaluation of their performance, seven performance criteria were used.They were RECTANGLES + FFANN, DAE + FFANN, RNN + FFANN, FHMM + FFANN, CO + FFANN, FFANN, SVM, and autoregressive integrated moving average (ARIMA).They also discovered that denoising autoencoder, which had a direct impact on the efficiency of the STLF at the individual household scale and that was the best method for energy disaggregation.When compared to ARIMA, the suggested approach (DEA + FFANN) results in a 91.13% reduction in root mean square error (RMSE) and normalized RMSE (NRMSE) and a 92.36% reduction in mean absolute error (MAE), according to a great comparison and performance analysis.
Kim and Cho [12] proposed a model called CNN-LSTM neural network that can successfully forecast the energy usage of homes by extracting spatial and temporal parameters.The CNN-LSTM neural network combines two techniques which are CNN and LSTM.CNN is known as convolutional neural network while the LSTM is called long short-term memory.The combined technique has demonstrated via experiments its ability to extract intricate details of energy use.The features between various factors that impact power consumption may be extracted using the CNN layer.Modeling temporal data of erratic tendencies in time series elements is ideal for the LSTM layer.For the usage of electric energy, which was previously difficult to forecast, the suggested CNN-LSTM technique yields nearly perfect prediction performance.The authors of the paper verified that, without relying on time resolution, their model beats other models in terms of performance.
Wei et al. [13] examine the prevalent data-driven strategies employed in building energy analysis at various levels of granularity and typology, including those strategies for prediction.Some of the methods are support vector machine (SVM), artificial neural networks (ANN), statistical regression, decision trees, and genetic algorithms.In addition, the review's findings show that data-driven approaches have successfully addressed a wide range of applications related to building energy, including load forecasting and prediction, energy pattern profiling, mapping regional energy consumption, benchmarking for building stocks, global retrofit strategies, and developing guidelines, among others.A few crucial tasks for modifying data-driven methodologies in the context of application to building energy analysis are refined in this review, which is significant.By performing the necessary retrofits and using renewable energy technology, the results reached in this research may help future micro-scale adjustments in the energy usage of a specific building.Additionally, it opens a door for prospective macro-scale energy-reduction research while taking customer desires into account.All of these will be helpful in developing a stronger long-term urban sustainability plan.
The study's findings by Fan et al. [14] are based on a cluster analysis of electricity demand information from 1,000 Swiss residences.To assess the suitability of various building archetypes (hospitals, schools, offices, hotels, apartments, households, etc.) for demand response programs, patterns of electricity use must be identified.The authors created four criteria to gauge each archetype's capacity for demand flexibility, ramp rates, and peak management.The findings demonstrated that the various archetypes' patterns of electricity use varied significantly and that occupant schedules had an impact on load profiles, peak intensities, and ramp rates.When compared to the non-domestic sector, the domestic sector (houses and apartments) had greater peak intensities, making it a target for peak reduction plans.Except for schools, mansions, and commercial buildings, the non-domestic sector, on the other hand, gave more possibilities for flexibility.The authors concluded that because of the flexibility of their demand, households should be encouraged to participate in dynamic energy tariffs.However, carefully planned programs with both financial and social components are required to foster acceptance.
Liu et al.'s [2] study report offered a data-mining-based method for assessing the variable refrigerant flow (VRF) system's dynamic energy performance.Moreover, it discovered issues with the energy assessment of the VRF system by analyzing the energy data gathered from experiments on a 29.8 kW type VRF system.The research put up a generic framework for assessing energy performance based on data-mining methods, including correlation analysis and decision tree (DT) analysis to address these issues.The findings demonstrated that the power consumption of the VRF system was highly impacted by both the PLR and OT, and that the power consumption patterns controlled how much the refrigerant charge fault affected system power consumption.To give a quantitative evaluation of the power consumption of the VRF system, the study produced nine energy benchmarks and created an energy consumption rating system.The proposed method for assessing the dynamic energy performance of the VRF system was found to be accurate and reasonable.Their future studies will focus on applying the proposed method to more operational conditions.
Aki et al. [9] present an approach for predicting the energy consumption for the next-day and peak power demand.The approach consists of three main steps.First, delete the unusual energy consumption, such as outliers.Second, they used an embedded variable selection method which is recursive feature elimination (RFE), to obtain the optimal inputs which will be used later in the eight prediction models that were separately developed using popular algorithms.Then these parameters will be obtained through level-group-out cross validation (LGOCV).Finally, they used a genetic algorithm (GA) to optimize the weight of the eight prediction models.Moreover, they used the mean absolute percentage error (MAPE) to measure the prediction accuracy of the ensemble model.The accuracy for the next-day energy consumption was 2.32% while for the peak power demand 2.85%.Data mining was used in the study by Wang et al. [15] to analyze energy usage and optimize dorm buildings.To increase accuracy and improve the quality of the data, outliers, and missing values are removed during the preprocessing stage.They also utilized the survey to compare how much energy each student was using.Additionally, the energy consumption analysis was conducted using the K-means clustering technique.Moreover, they initially standardized their data, then used the Mclust algorithm to discover the best number of clusters, which revealed that the data could be categorized into five categories.Furthermore, a decision tree can categorize the gathered dormitory attributes into four groups with a 91.3% classification accuracy.The result shows that 15.8% of the energy consumption can be reduced when the subject category is studied.
The study conducted by Sathishkumar et al. [16] in 2020 focuses on using data mining techniques to predict energy consumption in the steel industry.The study is centered around the use of data from the DAEWOO steel company, which includes current reactive power, carbon dioxide emission levels, power factor, and load types.General Linear Regression, SVM with Radial Basis Kernel, K Nearest Neighbor, Random Forest, and Regression Trees were just a few of the prediction models that were examined.The findings were assessed using Mean Absolute Error, Root Mean Squared Error, Mean Absolute Percentage Error, and Coefficient of Variation before the highest performing model was chosen.The findings showed that the Random Forest model was the most effective in predicting energy consumption and outperformed the other algorithms.The study highlights the importance of energy consumption prediction for energy production companies and the need to reduce energy usage for the benefit of the environment, economy, and domestic consumers.It specifically focuses on the fast-growing manufacturing industry in South Korea and the increasing electricity consumption in the industrial sector.The authors aim to provide a data-mining-based solution to manage energy consumption within the industry.The authors advise further research in this area and state that the data show the Random Forest model to be the most accurate at predicting energy usage in the steel sector.Zhao [17] proposes a prediction method for energy saving in public buildings based on data mining.The research starts by analyzing a large amount of data generated by the energy consumption monitoring system of public buildings to extract valuable information and relevant information.The study focuses on finding the distribution law of regional energy consumption influencing factors and energy consumption parameters and proposes an energy-saving transformation prediction of the energy consumption stochastic model.To do this, the author uses the idea of reverse modeling to establish a prediction stochastic model for energy-saving transformation.In this model, a genetic algorithm is used to optimize the sub-tree generation process of the gradient lifting decision tree and improve the short-term prediction accuracy of the C4.5 decision tree.The results of the studies demonstrate that this model has a greater prediction accuracy than conventional regression models and has some generalizability for various building equipment energy consumption data.The author concludes that the model is promising for predicting energy consumption and improving energy efficiency in public buildings.
Ali et al. [18] discuss that by establishing a more robust communication mechanism between consumer and supplier, Smart Grid enhances the architecture of the electrical grid.Data on customer electrical load profiles is now more readily available because of the use of smart meters.The researchers believe that the analysis of the data on electricity use is a key problem in improving and efficiently developing this new power system.Therefore, they decided that data mining techniques are the best analytical procedure to use when analyzing energy use or achieving their goal.They employed data mining approaches such as frequent patterns mining and association rule mining, clustering, classification and characterization, and outlier detection, in addition to exploratory data analysis and preprocessing.They described and assessed the strategies that are most effective for comprehending data from electrical load profiles.For instance, clustering aided them in identifying which household appliances consume more energy than others in this dataset.Peak reduction and peak analysis would both benefit from this.However, the clustering prediction result in their datasets was not very significant with 63.28% accuracy.They believe that in the future, they can enhance the results by using them in real-time or with more granular statistics like seasonal and user sociodemographic data.
Rathod and Garg [19] argue that data mining methods are employed on geographical characteristics like rivers, farms, ground, and highways to derive information about power usage regarding atmospheric temperature and physical distance.These methods are employed to identify regional power consumption patterns in a city.In addition, they discuss that different temperature and consumer groups are categorized according to how much electricity they consume using the K-means clustering approach.Using association rule analysis, rules on electricity consumption are developed to define the effects of the physical separation between natural geographic objects and various places.The approaches that the researchers followed involve pre-processing data, using data mining algorithms, and interpreting the knowledge they have found.Real datasets of over 20,000 Sangli city consumers are utilized to validate their proposed work.Ren et al. [20] conducted a study that looked at the electricity heating system efficiency of 62 apartments in an average rent price housing complex using data mining techniques.The results showed that data mining techniques are a valuable tool for analyzing huge sets of electricity performance data and shown usage patterns of residency comfort and heating operations numerically.The clustering analysis identified six unique patterns of room temperature demand, with most households falling into three of these patterns.The decision tree algorithm linked comfort demand class and heating system performance with heating electricity consumption, providing a valuable tool for constructing heating systems in average rent price housing.The room temperature patterns can be used to enhance thermostat settings and provide more accurate electricity simulations.This research shows the advantages of using data mining techniques to gain in depth into occupant behavior and increase electricity efficiency in low-income housing.
Song et al. [21] did a study about Accurate predictions scores of electricity consumption.He founded that the Accurate predictions scores of electricity consumption are critical for electricity efficiency in buildings.Despite enhancements, two major challenges remain in predicting electricity use based on residency data: a lack of consideration for variation among building residency and weak correlation between residency and electricity consumption.To address these challenges, the research faced the impact of residency characteristics on electricity prediction performance.Comparative experiments using data mining-based model techniques were used, leading to two key findings.Firstly, residency characteristics significantly impact prediction accuracy score.The GL-2 model was found to have the highest accuracy score but required longer network training.Secondly, the proposed model provided acceptable accuracy score with minimal historical data, with all results within the acceptable tolerance range value.The research advances our understanding of residency impact on electricity prediction and provides a practical solution with minimal data requirements.Liu et al. [22] proposed a framework based on data mining is suggested for extracting typical electrical load patterns TELPs and gleaning useful data from the patterns for specific buildings.Data preparation, TELP identification, and knowledge discovery are the three stages of the framework.It is suggested to use a two-step clustering analysis approach to find TELPs.This method minimizes the dimensions of daily energy consumption data, finds outliers, and combines patterns with similar characteristics to find TELPs.A comparison of the method's performance with two single-step clustering algorithms revealed its efficacy.The findings indicated that day type working, or non-working day and outside air temperature were the primary characteristics that separated TELPs.The framework offers a broad and methodical methodology for examining patterns of power use and could identify unusual electrical load profiles early on.Limitations, however, include the requirement for a more dataadaptive feature extraction approach, more framework validation with a bigger dataset, and developing subsystems.Finally, in a study conducted by Stjelja et al. [23] for building occupancy levels will be ascertained using sub-metered power and water use and machine learning techniques.A supervised data mining technique using Random Forest and an unsupervised data mining technique utilizing k-means clustering were both put to the test in this study.The study's findings demonstrated that, when utilizing the supervised technique and all available predictors, it is possible to forecast the number of people in an office using sub-metered water and power use.The training dataset should ideally be at least one to three months long.The unsupervised technique demonstrated that utilizing sub-metered office equipment power use, it is feasible to cluster the days into three occupancy levels (high, medium, and low) without the necessity for ground truth.The study found that utilizing lighting, power, and water use to gauge occupancy was not as effective as the unsupervised technique.Musleh and AlMetric proposed ensemble machine learning approaches to predict midterm electricity consumption prediction on a local dataset in Saudi Arabia [24,25].Based on the thorough literature review following can be concluded: 1. Data mining in the energy sector is among the hottest areas of research and promising in various ways.
2. Various algorithms have been investigated and, in many cases, Random Forest was found to be better due to its ensemble nature where decision is made by various trees consortium.
3. Most of these studies utilized categorical or class-based data, where each instance was labeled to a specific class.
4. That is why mainly it was dealt with as a classification problem.
5. The preprocessing techniques in such kind of datasets are mostly common.
However, in the current study, the dataset was not labelled.That is why we have investigated clustering approaches rather than classification.However, in the future, classification techniques can be employed on the labeled dataset.

Dataset description
The World Energy Consumption dataset in study [26], available on the "Our World in Data" website, is a comprehensive and open dataset that provides a wealth of information on energy consumption patterns around the world.
The dataset includes data on primary energy consumption, per capita energy consumption, and growth rates for various countries, as well as the energy mix and electricity mix for each country.With a temporal range of 1900 to 2021, this dataset offers a historical perspective on energy consumption trends across the globe.Additionally, the dataset includes a variety of other relevant metrics, such as data on energy efficiency and conservation measures, economic and population growth, and carbon emission.With over 22500 records, this dataset is an invaluable resource for researchers, policymakers, and anyone interested in understanding the global energy landscape.The description includes the attribute number and names, minimum and maximum values, measurement units and number of distinct values (DVs) in the dataset.After due preprocessing, it is worth noting that there are no missing values against each attribute and all attributes belong to nominal data type.The yearly consumption unit is kilowatt per hour (kWh), megawatt per hour (mWh) and terawatt per hour (tWh).Table 1 shows the dataset description which includes the attribute information and its various details like min, max and unit.

Data preprocessing
Prior to utilizing the World Energy Consumption dataset, we employed a thorough pre-processing procedure to ensure the integrity and accuracy of the data [27][28][29][30][31][32].Specifically, we removed any missing values by replacing them with a more appropriate constant value.We also removed any countries or state federations that did not contain any data, to eliminate any potential errors, discrepancies, and inconsistencies in the dataset.Consequently, the dataset was cleaned and ready to be investigated by the data mining algorithms.

Statistical analysis
This section provides the statistical analysis of the dataset which will eventually be helpful during the modelling, designing and analysis phase [33][34][35][36][37].That mainly includes: Table 2 in the Appendix presents a complete summary of the statistical analysis of the dataset being used in the study.

Graphical representation of data
This section contains the graphical representation of various aspects of the data.Visual representation plays a pivotal role in the data mining and knowledge discovery process [38][39][40][41][42]. Data visualization techniques are used to demonstrate various aspects.Figure 1 presents the energy consumption period mainly ranging from year 2000 to year 2016.It shows that energy consumption has been increased significantly over the years.Figure 2 and Figure 3 present the top five countries in the world with most energy consumption and energy generation.In this regard, China, USA, India, Russia, and Japan are top of the list.The purpose behind depicting the top five countries is that they cover a significant part of the overall energy consumption around the globe.It is further crucial to see the overall impact on the energy sector around the globe.It is also evident that these countries are either tech giants or densely populated or both.That clearly reflects the major factors for energy consumption.These trends are evident that in the future this demand is increasing significantly.
Figure 4 presents the yearly consumption comparison among various courses such as coal, oil, and energy.In this regard, primary energy consumption was more followed by coal and oil, respectively.Likewise, Figure 5 presents comparison among electricity production means.In this regard, coal-based electricity production was more than gas and oilbased production, respectively.

Data mining models
Data mining models are algorithms and mathematical approaches used to extract information from data, and they play a crucial role in the field of data mining.These models can be used for a variety of tasks, including classification, clustering, association rule mining, and anomaly detection.Some of the popular data mining models are simple K-Means, and Expectation Maximization (EM).Each model has its own strengths and weaknesses and is ideal for different types of data types and implementations.Choosing a data mining model often depends on the specific problem that needs to be solved, the characteristics of the data, and the desired outcome.

Simple K-Means
Simple K-Means clustering is a basic clustering algorithm.It is a straightforward and extensively used clustering algorithm the mechanism behind it is partitioning.It divides n pieces of data into k groups, where k is the number of clusters that the user has defined.Each item in a cluster is placed so that it is closest to the cluster's mean.The Euclidean distance is the unit of measurement used by the K-Means method to determine the distance between an object and the centroid.The process goes on until there is no net change left [43][44][45][46].Figure 6 presents the clustering plot using a simple K-Means algorithm.Consequently, there are 10 clusters from 0 to 9 as given in the figure's legend.Which depicts more energy consumption in the green (cluster 2), red (cluster 1), yellow (cluster 6), and blue (cluster 3) zones countries while others are on relatively lesser sides.The diagram further demonstrates the energy consumption patterns and based on their usage patterns; clusters are formed.Thicker regions present more while thinner regions present a lesser energy consumption rate.

Expectation Maximization (EM)
The Expectation Maximization (EM) algorithm is more efficient than existing algorithms in terms of data loss.The EM algorithm is proven to converge to the most likely estimate of the original distribution based on altered data.When large amounts of data are available, the EM algorithm provides good estimation of the original distribution [47][48][49][50].Figure 7 shows the clustering output of EM algorithm.The figure demonstrates three clusters as an outcome of the algorithm, namely cluster 0, 1 and 2 represented by blue, red and green colors, respectively.Cluster 1 represents the highest energy consumption, followed by cluster 0 and finally cluster 2.

Hierarchical clustering
Hierarchical clustering is a technique for grouping data into clusters that creates a hierarchy of clusters.However, the results of this method can be reduced by the impossibility to adjust a merge or split decision once it has been made.Enhancing the quality of hierarchical clustering can be reached by combining it with other algorithms for multi-stages clustering [51][52][53][54][55][56][57][58][59]. Figure 8 presents the output of hierarchical clustering algorithm.Resulting in ten clusters 0-9 where cluster 0 represents the highest energy consumption country.

CONCLUSIONS
In conclusion, the application of data mining techniques demonstrates how it is possible to make wise judgments and drive positive change in the energy sector.The findings of this research highlight the significance of data-driven decisionmaking, which can help build a more effective energy policy and meet the challenges of the future.The world's energy sector faces significant challenges in meeting rising demand, reducing costs, and increasing energy efficiency.However, the solution to these problems may lie in the vast volumes of data available.By using data mining techniques, it is possible to examine global electricity consumption data and uncover valuable insights and trends.This research shows the favorable outcomes obtained from simple K-Means and EM algorithms applied to the dataset, in contrast to Hierarchical clustering algorithm.These insights provide valuable information for decision-making in the energy sector, enabling better cost management and a more effective energy policy.The findings from this analysis can also serve as a platform for future research and development in the field.Data-driven decisionmaking is crucial for the energy sector, and continued investment in data mining techniques is necessary to confront the future challenges of the industry.The development of deep learning and hybrid approaches can offer even greater insights and further contribute to positive change [60][61][62][63][64].

Figure 6 .
Figure 6.Clustering plot of simple K-Means

Figure 7 .
Figure 7. Clustering plot of EM

Table 2 .
Statistical analysis of the dataset