HERF: A Machine Learning Framework for Automatic Emotion Recognition from Audio

ABSTRACT


INTRODUCTION
The human voice reflects different kinds of emotions.Recognition of such emotions is useful in different applications.In the recent past, researchers started working on automatic emotion recognition with different techniques.The emergence of machine learning (ML), has brought unprecedented possibilities to exploit Artificial Intelligence (AI) to leverage emotion recognition performance using audio samples.However, it is challenging due to practical issues in extracting emotional content from audio samples [1].Recognizing emotions from speech has practical applications in the real world.Such applications need interaction between humans and machines.Based on human emotion recognition, in the motor field, it is possible to know the mental state of the driver and take safety precautions.It can be used in the healthcare domain by therapists as a diagnostic tool.It is widely used in automatic translation systems where a speaker's emotion recognition plays a crucial role.It is of more use in aircraft cockpits due to the ability of the system to understand the stress level of pilots.It is used in call center applications and communications over mobiles.Many systems in the real world adapt responses to the recognition of human emotions.
The research found in the literature has revealed that there are many contributions to emotion recognition using ML models as explored in the study [1][2][3].SVM is the ML technique used for emotion recognition.Feature selection and optimization using the evolutionary method are employed in the study [4].Multi-model data and an ensemble of ML models are used in the study [5] for improving prediction performance.Another ensemble model with bagged SVMs is explored in the research [6] for exploiting many models.Deep learning models are also found useful in human emotion recognition as studied [7][8][9][10].Pre-trained deep CNN model is used in the research [11] besides an attention model.In the study [12], Multilayer Perceptron (MLP) and DL models are employed.A weighted approach using deep learning is explored in the research [1].From the literature, it is ascertained that ML models have the potential to acquire intelligence from historical data or training samples for the automatic detection of emotion from audio samples.However, there is a need for optimal feature selection and neural network combination to improve performance further.Towards this end, our contributions are given below.

RELATED WORK
This section covers an overview of the prior works concerning existing approaches to emotion recognition from audio.Seng et al. [2] highlighted the societal shift towards spiritual aspects and proposed an emotion communication system to address non-line-of-sight challenges, facilitating real-time multimedia emotion transmission.Livingstone et al. [13] focused on speech emotion classification, focusing on feature selection, classification methods, and emotional speech database preparation.The discussion addressed performance limitations, emphasizing the need for improved classification accuracy and database quality.Proposed extensions included the integration of speaker-dependent and independent systems, along with exploring temporal structure modelling and multiple classifier systems.Kumar et al. [14] introduced a video-based abnormal human activity recognition system for elderly care, ensuring privacy through binary silhouettes.Using R-transform and KDA, the system achieved high recognition accuracy.Future enhancements involve incorporating depth silhouettes for heightened discrimination.Livingstone et al. [13] proposed a novel architecture to discriminate emotional physiological signals from multichannel bio signals, achieving a recognition rate of 77.68% for four emotional states.The approach holds promise for effective healthcare applications, particularly in monitoring elderly or chronically ill individuals.Kumar et al. [14] used EEG and ML to categorize human emotions, attaining a good level of accuracy.The identified features suggest practical applications in non-invasive emotional assessment, with future work aiming to deepen the understanding of brain responses to music at various stages.
Bhavan et al. [15] Introduced a Hindi speech recognition system utilizing the Discrete Wavelet Transform and the K Means Algorithm, where the Daubechies8 with 5-level decomposition demonstrated optimal outcomes.Speaker independence was achieved through Hidden Markov Models (HMMs).Tarantino et al. [16] presented a context-sensitive multimodal emotion recognition approach, incorporating BLSTM networks for capturing context-related information BLSTM exhibited superior performance compared to standard techniques, achieving notable discrimination in emotional space clusters.Prospective investigations may delve into dynamic modelling for low-level features and integrate linguistic information into the system.Li et al. [17] developed a SER system extracting features like MFCC and MEDC, utilizing SVM for emotional state recognition.The system demonstrated high accuracy, indicating independence from speaker and text variations.Batziou et al. [3] explored Speech Emotion Recognition, focusing on enhanced performance through various features.Feature selection using Fast Correlation-Based Filter (FCBF) identified 25 key features and the Fusion of Artificial Neural Networks (FAMNN) with Genetic Algorithm (GA) optimization.Xu et al. [18] proposed a modified supervised manifold learning algorithm (MSLLE) for spoken emotion recognition, emphasizing improved interclass distance and generalization.Experimental results on two databases showcased the superior performance of MSLLE over other methods.
Ma et al. [19] used a novel deep CNN architecture, PCNSE-SADRN-CTC, designed for discrete speech emotion recognition, demonstrating efficiency on various datasets.Future research aims to explore its applications in diverse speech-related tasks.Shah et al. [20] introduced Fusion-ConvBERT, a pioneering fusion network model for Speech Emotion Recognition (SER), showcasing superior performance across datasets.Imani et al. [21] proposed a novel Speech Emotion Recognition (SER) method, amalgamating DCNN and BLSTMwA models, surpassing popular approaches on the EMO-DB and IEMOCAP datasets.Zhang et al. [22] introduced the RAVDESS, a validated multimodal emotional database featuring 24 actors expressing diverse emotions in speech and song, freely accessible to researchers.Jiang et al. [23] discussed Chroma feature extraction using STFT and.CQT, highlighting the advantages of STFT for chord recognition projects.
Koduru et al. [24] examined various ANN models for forecasting based on temporal data.The study concludes that RBF yields improved accuracy, followed by RNN and MLP, while GRNN exhibits the lowest efficiency.Emphasizing the versatility of ANN models in psychology, the research advocates for effective pre-processing of time series data and proposes an enhanced performance index.Despite the flexibility demonstrated, researchers are encouraged to address limitations and explore the generalization of these models to other databases in future studies.Lee et al. [25] concentrated on voice-based emotion detection utilizing MLP and CNN classifiers, with CNN outperforming MLP in a web application.Plans involve expanding training data, experimenting with different architectures, and extending emotion detection to video, image, and text inputs.Lim et al. [26] tackled Speech Emotion Recognition (SER) challenges, introducing Head Fusion with multi-head attention, resulting in improved accuracy (76.18%WA, 76.36% UA) and robustness against added noise.Zhang et al. [27] used an audio-visual system integrating rule-based and machinelearning techniques.The system employs BDPCA+LSLDA and OKL-RBF neural classifiers for visual optimization.Future refinements include optimizing window length and overlap in the audio path and exploring applications such as customer satisfaction assessment.Shah et al. [20] conducted a review of EEG-based methods, encompassing feature extraction, reduction, ML classifiers, and the correlation of EEG rhythms with emotions.The review compares ML and deep learning algorithms, identifying open problems certain issues Raghu Vamsi et al. [28] Present an attention mechanism that aligns speech frames with recognized text, the proposed model achieves better results on the IEMOCAP dataset.Zhao et al. [29] contribute to speech emotion recognition by proposing a bagged ensemble with Gaussian-kernel SVMs, demonstrating superior performance across three datasets compared to stateof-the-art approaches.Xu et al. [30] introduce a multitask learning method, incorporating self-attention and gender classification as auxiliary tasks, resulting in a 7.7% improvement over existing methods on the IEMOCAP dataset.Focusing on enhancing speech emotion recognition, Lee et al. [25] utilize feature extraction methods like MFCC, DWT, pitch, energy, and ZCR, leading to improved accuracy and efficiency in experimental results.Addressing multi-modal emotion recognition challenges, Li et al. [17] propose a deepweighted fusion method that incorporates cross-modal noise modelling and effective feature extraction.
Li et al. [31] Examining emotion recognition with eyetracking technology, the study outlines pertinent features, and challenges, and acknowledges the limited existing literature in this domain.Zhang et al. [27] delve into enhancing Speech Emotion Recognition (SER) by employing a novel windowing system and self-attention, showcasing improved performance on the IEMOCAP dataset.Introducing the Probability and Integrated Learning (PIL) algorithm, Raghu Vamsi et al. [28] tackle complex human emotion recognition, specifically addressing emotional uncertainty through classification probability.While showing promise for affective computing in videos and artificial emotion for robots, the method warrants further exploration and refinement.Zhao et al. [29] contribute a cost-effective navigation and face recognition system for the visually impaired utilizing IoT, smartphones, GPS, and ultrasonic sensors.Achieving 90% face recognition accuracy and 95% obstacle detection, potential future work involves broader object recognition and dynamic face reactions with IoT integration, considering the current limitation of static face recognition.Recognizing the significance of learner emotions in e-learning, Xu et al. [30] explore methods such as voice recognition, facial expressions, and gestures.Their findings indicate that multimodal systems, combining various aspects prove more effective than singlemodal approaches However, there are certain limitations in existing methods, particularly the lack of feature engineering to enhance prediction performance.

MATERIALS AND METHODS
This section presents materials and methods in terms of dataset details, the proposed framework, algorithms and evaluation methodology.

Dataset details
RAVDESS is widely used dataset [31] with 1440 audio samples from 24 professional actors of both genders covering different emotions.Each emotion has two levels of intensity known as normal and strong.
As presented in Table 1, there are 8 classes of emotions including neutral which actually does not reflect any emotion.
In fact, neutral does mean absence of emotion.Therefore, there is not strong intensity level associated with this class.

The framework
We proposed a ML based framework known as Proposed Human Emotion Recognition Framework (HERF).It has provision to take audio dataset as input and performs supervised learning for automatic detection of human emotions from audio.Given inputs are subjected to audio normalization followed by feature extraction and selection of features.Afterwards, a neural network based classifier is trained.The training results in knowledge model creation.This knowledge model is used for prediction of new audio samples and recognize 8 categories of emotions.The model is saved for further usage.A web based application is developed and deployed.This application enables users to choose an audio sample for testing and the saved model is reused to predict class label.Normalizing audio is critical as it involves adjusting the volume of an audio signal, ensuring a uniform level, and enhancing audibility for ease of listening.As presented in Figure 3, Multilayer Perceptron (MLP) is a kinf of ANN model.It has 3 significant layers and the functioning is reflected in Eq. ( 1) and Eq. ( 2).ℎ 1 = ( 1 ) = ( 1 . +  1 ) (1) ANN variants need batch based training where X is the input vector.Eq. ( 3) shows how k instances are obtained from available ones.
Afterwards combining the instance is done as in Eq. ( 4). X= Having understood it, y is computed as in Eq. ( 5).
where, (k, n) is shape of vector X, n denotes number of values in input while k denotes number of instances.W is a matrix.

Proposed algorithms
We proposed two algorithms for realizing the framework.HFS is proposed to find discriminating features (HFS) is proposed to find features that contribute to class label prediction.Another algorithm known as Neural Network based Automatic Emotion Recognition (NN-AER) which exploits Multilayer Perceptron (MLP) and HFS for automatic emotion recognition.6) and Eq.(7).
In F1-score computation, p denoted precision while r denotes recall values.All these metrics when evaluated result in a range of value from 0 to 1 reflecting least and highest performance.

RESULTS AND DISCUSSION
Experiments are made to evaluate the emotion recognition efficiency of the proposed MLP-based ML framework with underlying algorithms.RAVDESS dataset [1] is used for experiments.Data is split into 70% (training) and 25% (testing).MLP classifier is configured using the values presented in Table 2.After setting parameters, MLP is trained with a 70% training set.After training the model, it is tested with the 25% test samples that were not known to the model.This section presents exploratory data analysis and performance evaluation.The proposed algorithm named Neural Network-based Automatic Emotion Recognition (NN-AER) exploits the MLP classifier and also the HFS algorithm to improve prediction performance.On predicting the class of test samples, ground truth is used to generate a confusion matrix.Figure 8 shows the generated confusion matrix based on which performance evaluation is made.

Data analysis
Data analysis with the RAVDESS dataset is provided in this section.It has 8 classes of emotions in training and test samples.
As presented in Figure 5, wave plot is generated for an audio consisting of happy emotion in its audio content.
As presented in Figure 6, Spectrogram is generated for an audio consisting of happy emotion in its audio content.
As presented in Figure 7, wave plot is generated for an audio consisting of sad emotion in its audio content.
As presented in Figure 8, Spectrogram is generated for an audio consisting of sad emotion in its audio content.

Experimental results
With the given 25% test data, the proposed algorithm with underlying MLP classifier and HFS algorithm could produce prediction results in the form of confusion matrix for all emotion classes.
Figure 9 shows confusion matrix for 8 classes.By using the prediction results, efficiency of the models is measured in terms of accuracy, F1-score, precision and recall As presented in Table 3, performance of the proposed model for human emotion recognition from audio is provided with several metrics such as precision, recall and F1-score.
As presented in Figure 10, precision performance of the proposed model is observed for different classes.Precision for neutral class is 88%, calm 91%, happy 71%, sad 70%, angry 84%, fearful 87%, disgust 78% and surprised 71%.Least precision is exhibited for sad emotion with 70%.Highest precision is exhibited by 91% for detection of calm emotion.

Performance evaluation
This section presents compares our model with existing models found in Raghu Vamsi et al. [28] and Xu et al. [30].
As presented in Table 4, there are 8 classes of emotions for which the performance of each model is provided.
Figure 13 shows the performance of all models.For the neutral class, the highest performance is shown by 88.29% of the proposed model.For the neutral class, the highest performance is shown by 88.29% of the proposed model.For calm emotion highest performance exhibited by the proposed model with 90.72%.About happy class's highest performance is shown by the study [13] with 73.49%.When sad emotion is considered, the highest performance is observed at 81% [14].The highest performance for angry emotion is exhibited by the study [14] with 88%.Concerning fearful emotion showed the highest performance with 89% [13].Disgust is the emotion that exhibited the highest performance 7831% with the proposed model.The highest performance is exhibited by the study [14] for surprised emotion.
As presented in Table 5 accuracy of different models is observed.Higher accuracy denotes better performance.

Figure 14. Accuracy of all models
Figure 14 shows models.The accuracy of the study [13] is 68.49% while the accuracy of the study [14] is 76.36%.The highest accuracy is exhibited by the proposed model with 81%.Therefore, it can be understood that, as far as human emotion recognition from audio is concerned, the proposed model outperforms existing ones.

LIMITATIONS OF THE PROPOSED FRAMEWORK
The presented framework relies on machine learning techniques for the recognition of emotions from audio content.It adopts an effective approach to feature engineering, enhancing the quality of training during the learning process.The Multilayer Perceptron (MLP) serves as the neural network model, demonstrating efficiency in classification tasks and is adept at categorizing eight distinct emotion classes.However, despite achieving a model accuracy of 81%, the highest among the evaluated models, there remains potential for further improvement in accuracy.Additionally, it's noteworthy that the study utilizes the RAVDESS dataset.To ensure the model's efficiency is generalizable, it is imperative to evaluate its performance across multiple datasets.

CONCLUSION AND FUTURE WORK
A framework is proposed based on ML for automatic recognition of human emotions from a given voice content.The framework is named as Human Emotion Recognition Framework (HERF).We proposed two algorithms for realizing the framework.HFS is proposed to find discriminating features.Another algorithm known as Neural Network based Automatic Emotion Recognition (NN-AER) which exploits Multilayer Perceptron (MLP) and HFS for automatic emotion recognition.RAVDESS is dataset used for empirical study.This dataset supports 8 categories of emotions.HERF introduces novelty by harnessing multiple feature selection methods to enhance the training process.We also designed a web application used to recognise emotion for given audio sample based on saved MLP model.Experimental results revealed that the proposed algorithm NN-AER shows better performance of prior methods with highest accuracy 81%.In future we propose DL based framework for improving accuracy further.

Figure 1 .
Figure 1.Proposed Human Emotion Recognition Framework (HERF) As presented in Figure 1, the framework has provision for automatically recognizing human emotions.In the process feature selection plays crucial role as it can influence the quality of training a classifier.Feature selection the process in which different features from the audio sample are extracted and used.Three kinds of features such as MFCC, STFT and Mel spectrogram features are used in the empirical study.MFCC features effectively represent human voice, employing a logarithmic power spectrum's linear cosine transform on the Mel-frequency scale, which exhibits nonlinearity.These coefficients, derived from an audio clip, offer a cepstral representation of the audio.The advantage of MFCC over standard cepstral concepts lies in its utilization of evenly spaced frequency bands on the Mel scale, approximating human voice.Another method for audio feature extraction is Short Time Fourier Transform (STFT), which computes a Fourier transform from the audio signal, primarily designed for pitch shifting, pitch detection, and noise reduction.Additionally, this paper employs Mel spectrogram as a feature extraction technique, providing a visual representation of frequency spectra over time in the voice.The Mel spectrogram creates a 2D modelling of audio clip reflecting the amplitude Figure 2 illustrates model training and testing associated with the HERF framework.

Figure 2 .Figure 3 .
Figure 2. Illustrates model training and testing of HERF framework

Algorithm 1 .Algorithm 2 .Figure 4
Figure4shows confusion matrix which helps in deriving values that show difference between ground truth and predictions.

Figure 4 .
Figure 4. Confusion matrix Computation of the performance metrics are based on correct and wrong predictions of a ML model.Precision and

Figure 8 .
Figure 8. Spectrogram of sample audio with sad emotion

Figure 9 .
Figure 9. Prediction results reflecting 8 classes in the form of confusion matrix

Figure 13 . 5 .
Figure 13.Emotion recognition performance comparisonTable 5. Shows accuracy comparison of different emotion recognition models STFT, and Mel Spectrogram features from each audio sample.These features are fused to create a feature map, which is subsequently utilized in the NN-AER algorithm.3. Another algorithm known as Neural Network-based Automatic Emotion Recognition (NN-AER) which exploits Multilayer Perceptron (MLP) and HFS for automatic recognition of human emotions from audio samples.During the training phase, the HFS algorithm is executed to acquire the feature map corresponding ground truth values to train the MLP classifier.Once the model is trained, it is stored for subsequent use, facilitating the automatic prediction of emotion recognition classes for a given audio sample.4. Our framework is evaluated with web web-based interface to test new audio samples from the knowledge model saved after training a based classifier.The following sections lighten different aspects of the paper.Section 2 reviews the literature of prior works on human emotion recognition from audio files.Section 3 presents the materials and methods associated with the research.Section 4 shows empirical results.Section 5 delves into the proposed framework and outlines the limitations of the underlying model.Section 6 concludes our work and provides future scope.
1.A framework based on ML is proposed to recognize human emotions from a given voice content.The framework is named as Human Emotion Recognition Framework (HERF).2.The Hybrid Feature Selection (HFS) algorithm isproposed to find features that have discriminative power.The system employs an iterative process to extract MFCC,

Table 1 .
Emotion classes and their intensity in RAVDESS dataset

Table 2 .
Parameters configured with MLP classifier

Table 3 .
Performance of the proposed model for human emotion recognition from audio

Table 4 .
Shows performance of all emotion recognition models