Analysis and Emotion Recognition of Educational Network New Media Images Based on Deep Learning

.


INTRODUCTION
In today's rapidly developing information technology era, educational network new media has become an important tool and resource in the field of education, with an increasing amount of educational content being presented to students in the form of images and videos [1][2][3][4].With the widespread application of these visual contents, effectively analyzing and recognizing the information and emotional states in ENNM images has become a key issue for improving educational quality and student learning experience [5][6][7][8].By using deep learning techniques to analyze and recognize the content and emotions of ENNM images, educators can better understand students' emotional reactions and provide strong support for personalized teaching.
Related research shows that image content analysis and emotion recognition based on deep learning have achieved significant results in multiple fields.In the field of education, accurate image content analysis can automatically annotate educational resources, providing teachers and students with efficient retrieval and usage methods [9][10][11]; while emotion recognition can help teachers understand students' emotional states in real-time, adjust teaching strategies timely, and improve teaching effectiveness [12][13][14].Therefore, in-depth research on the content analysis and emotion recognition of ENNM images has important theoretical value and practical significance.
Although existing research methods have achieved preliminary results in image content analysis and emotion recognition, there are still some shortcomings.Firstly, traditional image content annotation methods often ignore the re-calibration of features when dealing with ENNM images, resulting in inaccurate annotation results [15][16][17][18].Secondly, in terms of emotion recognition, most existing methods only focus on the judgment of emotion categories, ignoring the recognition of emotion intensity and subtle changes, which fails to meet the high-precision requirements of emotion analysis in educational scenarios [19][20][21].Therefore, a more precise and refined analysis method is urgently needed to improve the effectiveness of content and emotion recognition of ENNM images.
The main research content of this paper is divided into two parts: one is the ENNM image content annotation method based on feature re-calibration, and the other is the ENNM image emotion recognition method based on cyclic structural representation.In terms of content annotation, by introducing feature re-calibration techniques, the accuracy and robustness of image content annotation are improved; in terms of emotion recognition, by constructing a progressive cyclic loss function and combining emotion intensity and polarity, fine-grained emotion recognition is achieved.This research not only addresses the shortcomings of existing methods but also provides educators with more accurate and detailed tools for image content and emotion analysis, offering significant theoretical and practical value.

ENNM IMAGE CONTENT ANNOTATION METHOD BASED ON FEATURE RE-CALIBRATION
In terms of improving the performance of ENNM image content annotation, although there has been a large amount of research focused on spatial dimension optimization, this paper adopts an innovative approach by conducting in-depth research on the relationships between feature channels to achieve feature re-calibration.Unlike other image feature recalibration, the feature re-calibration of ENNM image content focuses more on the semantic understanding of educational content and the extraction of relevant features, ensuring that the annotation can more accurately reflect the educational information in the image.To this end, this paper proposes an ENNM image content feature re-calibration method based on SENet, by assigning different weights to different channels, thereby enhancing important features related to educational content and suppressing irrelevant or interfering features, thus improving the accuracy and efficiency of image content annotation.
For the application scenario of ENNM image content annotation, the feature re-calibration method based on SENet can better serve the precise extraction of educational information.Its annotation process particularly focuses on the semantic relevance of educational content by enhancing important features related to educational information and suppressing irrelevant or interfering features, significantly differing from the general focus of image feature re-calibration.Specifically, first, it transforms the input feature map into a z×g×q feature map through a series of convolution and pooling operations; then, through the Squeeze operation D1(.), compresses the feature map into a 1×1×z feature vector; then, through the Excitation operation D2(, q), the dimension of the feature vector remains unchanged, but its value is recalculated.Finally, this feature vector with new values is weighted and fused with the original z×g×q feature map to obtain the feature re-calibration result.Figure 1 shows the SENet module schematic.
Specifically, the feature re-calibration method based on SENet can be divided into three parts: (1) Squeeze: First, perform global average pooling on the z2×g×q feature map to obtain a feature vector of size 1×1×z2, that is, compress each g×q feature map along the z2 direction to form a 1×1 real number sequence.Since the z2 g×q feature maps are a collection of local descriptors, these descriptors have global expressiveness for the entire image.In ENNM images, this process helps extract global information reflecting the overall educational scene, thereby forming a feature vector with a global receptive field.(2) Excitation: Next, use two fully connected layers to perform nonlinear transformations on the Squeeze results.As the network deepens, this operation makes the features more inclined towards a certain category, especially the categories related to educational content.In this way, the weights of the feature channels are optimized, making the features closely related to educational content more prominent, enhancing the network's ability to recognize and understand educational content.(3) Scale: Finally, based on the output of Excitation, obtain the weighted features.By multiplicatively weighting these important features with increased weights onto the original features, channel-level feature re-calibration is completed.This process ensures that in the content annotation of ENNM images, key educational information can be highlighted, and irrelevant or interfering information can be suppressed, thereby improving the accuracy and effectiveness of the annotation.

Figure 1. SENet module schematic
RefineDet is based on a feedforward convolutional neural network, generating a fixed number of bounding boxes and predicting the presence and scores of different category objects within these boxes.Then, the final classification results are produced through non-maximum suppression.RefineDet includes two inline modules and four feature fusion modules (TCB).The inline modules include the anchor refinement module (ARM) and the object detection module (ODM).ARM consists of a base network with the classification layer removed and auxiliary layers added, responsible for binary classification and regression, deleting refine anchors with negative confidence scores greater than 0.99.In the annotation of ENNM images, this step helps remove low-confidence irrelevant information and retain potential educational content areas.ODM uses the refine anchors from ARM as input and integrates the output of the TCB modules for further object detection, generating object scores and offsets.The TCB modules enhance the overall detection performance by discarding redundant information.
This paper chooses to add SE modules to the four-way features of the ARM and ODM modules in RefineDet, as shown in Figure 2. Since the method of adding SE modules to ARM and ODM is consistent, ARM is used as an example for explanation here, as shown in Figure 3. First, the SE module performs global average pooling on the z2×g×q feature map to obtain a 1×1×z2 feature vector, extracting globally expressive features.Then, two fully connected layers are used to perform nonlinear transformations on the feature vector, optimizing the weights of the feature channels, allowing the network to focus more on features related to educational content.Finally, through multiplicative weighting, these optimized features are applied to the original features, completing channel-level feature re-calibration.

EMOTION RECOGNITION FROM ENNM IMAGES USING CYCLIC STRUCTURAL REPRESENTATION
Emotion recognition of ENNM images requires special attention to emotion features related to educational content.Through the RefineDet method based on feature re-calibration mentioned earlier, we have optimized the educational content features in the images, allowing the emotion recognition process to more accurately capture emotion expressions related to the educational context.Moreover, educational content often involves complex emotion expressions, such as motivation, encouragement, care, etc. Accurate recognition of these emotions is crucial for content annotation and user experience.To this end, this paper proposes a cyclic structurebased representation method for emotion recognition of ENNM images, analyzing and modeling image emotions by utilizing the inherent relationships between emotion categories.This method is proposed based on the circumplex model of affect in psychology, systematically guiding the learning of image emotion distribution by using the structure and characteristics of the emotion ring as prior knowledge.Furthermore, this paper proposes a progressive cyclic loss, combined with KL loss, to learn the differences between the predicted and labeled emotion distributions from coarse to fine.In this process, emotion characteristics are considered, enabling more refined class distribution learning.Figure 4 shows the schematic of the emotion recognition model for ENNM images.

Construction of the emotion circumplex
Inspired by psychological models, this paper constructs an emotion circumplex specifically for emotion recognition of ENNM images to learn image emotion distribution in a more specific and reasonable manner.The emotion circumplex provides a structured way to represent and learn these emotional states, allowing the emotion recognition process to better capture emotional expression features in educational contexts.By constructing the emotion circumplex, we can map any emotional state in ENNM images to an emotion vector ru, which reflects not only the type of emotion but also its specific position and attributes on the circumplex.Assuming that emotion polarity, emotion category, and emotion intensity are represented by ou, ϕu, and eu respectively, we have: (1) Emotion Polarity ou: Unlike general emotion recognition tasks, emotion recognition in ENNM requires special attention to emotion polarity to ensure that educational content can appropriately convey emotional information, promoting learning and interaction.Drawing on Mikels' eight-category emotion dataset model, this paper subdivides emotions into two polarities: positive and negative.Specifically, in educational contexts, emotions such as encouragement, appreciation, satisfaction, and pleasure are classified as positive emotions, while frustration, sadness, disgust, and anger are classified as negative emotions.To align the structure of the emotion circumplex with the aforementioned emotion polarity division, this chapter evenly divides the emotion circumplex into two semicircles, one representing positive emotions and the other representing negative emotions.This division not only accurately reflects the emotion distribution in ENNM images but also, through the clear definition of emotion polarity, helps educators better understand and use emotional data to optimize teaching content and strategies.13 0, 0, , 2 22 13 1, , 22 (2) Emotion Category ϕu: To meet the complex emotional expression needs in ENNM, the emotion circumplex also includes compound emotions, which are distributed in the gaps between basic emotion vectors.These compound emotions can reflect more subtle emotional changes and interactions in educational scenarios, such as mixed emotional states that students may experience during the learning process.This detailed categorization of emotions can help educators gain a deeper understanding of students' emotional responses, thereby adjusting teaching strategies and improving educational outcomes.Therefore, the definition of emotion categories is crucial in the application of emotion recognition in ENNM images.To better maintain the circular structure on the emotion circumplex and meet the emotion recognition needs in educational contexts, this paper defines the emotion category ϕu ∈ [0,2π] through the polar angle in polar coordinates.This method can not only represent eight basic emotions: encouragement, appreciation, satisfaction, pleasure, frustration, sadness, disgust, and anger, but also cover more complex compound emotions.Assuming the number of emotion categories in the psychological model is represented by Z, the expression is: (3) Emotion Intensity eu: Emotional expressions in educational scenarios often have multi-layered complexity.For example, students may exhibit varying degrees of encouragement or frustration when facing learning tasks, and the intensity of these emotions provides important reference value for educators to adjust teaching methods and content.This paper further defines the emotion intensity of the eight basic emotion vectors to ensure sufficient subtlety in basic emotion recognition.Specifically, the radial coordinate in polar coordinates is used to represent the emotion intensity eu∈(0,1] of a specific emotion category ϕu.By setting the emotion intensity, we can more accurately capture and describe the emotional changes in ENNM images.For example, a photo showing a student successfully completing a task may have a very high emotion intensity, close to eu=1, while a photo reflecting a student's slight confusion in learning may have a lower emotion intensity but still greater than 0. This detailed definition of emotion intensity helps educators promptly understand students' emotional states: (4) Similarity In the application of emotion recognition in ENNM images, the definition of similarity between different emotions is also a key concept.This paper defines the similarity between different emotions as the distance between their corresponding emotion vector polar angles.Specifically, image u1 and image u2 are considered to belong to the same emotion category if and only if ϕu1=ϕu2.This similarity definition is especially important in emotion recognition of ENNM images because emotional expressions in educational scenarios have specific contexts and meanings.For example, the confusion and slight anxiety exhibited by students in class may be very close emotionally, with a small polar angle difference, while there is a significant difference from pleasure and confidence, with a large polar angle difference.By clarifying the distance between emotion vector polar angles, we can more accurately identify and distinguish students' emotional expressions in different learning states.
(5) Additivity According to psychological theory, emotions have additivity, where each compound emotion can be formed by the weighted combination of basic emotions.Therefore, it can be considered that emotional expressions in educational scenarios are often complex and multi-layered.For example, an image may simultaneously contain a student's curiosity and anxiety, and these two emotions can be combined into a compound emotional state through weighted addition, more accurately reflecting the student's real emotional experience in a specific educational context.This paper achieves the formation process from basic emotions to compound emotions through vector addition.In this way, the application of this additivity definition in emotion recognition of ENNM images not only helps identify complex emotional states but also provides educators with a systematic emotion analysis tool.For example, by weighting and combining different emotion vectors in an image, teachers can identify which images reflect students' positive emotions and which images reveal students' negative emotions.This detailed emotion analysis can help teachers promptly adjust teaching strategies and provide targeted emotional support.

Mapping of emotion vectors
Based on the three attributes and two features defined above, this paper proposes a systematic method for mapping the emotion distribution of any ENNM image into a compound emotion vector on the emotion circumplex.ENNM images usually contain emotional expressions of students in different learning scenarios, such as curiosity, anxiety, excitement, etc.Since the emotion vectors are defined in the polar coordinate system and the vector addition operation is defined in the Cartesian coordinate system, the proposed algorithm defines basic emotion vectors in the polar coordinate system and then merges the weighted basic emotion vectors in the Cartesian coordinate system, eventually obtaining a compound emotion vector with three attributes-emotion polarity, emotion category, and emotion intensity-represented in the polar coordinate system.

Progressive cyclic loss
The loss function for learning image emotion distribution is usually the Kullback-Leibler (KL) loss function.Suppose the annotated value of the emotion distribution of educational network new media images in the dataset is represented by fu, and the predicted value of the emotion distribution is represented by fu ^.The features extracted from the emotion images using ResNet-50 are represented by du.The number of emotion images is represented by V, and the number of emotion categories in the dataset is represented by Z, then the function expressions are: In this paper, the emotions of ENNM images are not a set of unrelated category labels but are distributed in a cyclic structure based on psychological models.To effectively utilize this prior knowledge, this paper further proposes a progressive cyclic loss, aiming to learn the emotion distribution from coarse to fine.The purpose of the progressive cyclic loss is to penalize the differences between the annotated emotion vector ru=(ou,ϕu,eu) and the predicted emotion vector ru=(ou ^,ϕu ^,eu ^).Specifically, this chapter progressively establishes constraints on the three attributes of the emotion vector: ou, ϕu, and eu.That is, the loss first ensures that the predicted emotion polarity of the ENNM image is consistent with its annotated emotion polarity through polarity loss, and then gradually refines the accuracy of the emotion category and emotion intensity of the ENNM image.This coarse-to-fine learning process allows the model to optimize step by step, thereby better adapting to the complex and subtle emotional features in ENNM images.Assuming the annotated value of emotion polarity is represented by ou, and the predicted value of emotion polarity is represented by ou ^, the calculation formula is: In the context of emotion recognition of ENNM images, students' emotional states not only affect their learning performance but also influence the classroom atmosphere and teachers' teaching strategies.Therefore, accurately identifying and distinguishing different emotion categories is particularly important.Hence, this paper further introduces category loss, which not only focuses on the differences between the predicted and annotated categories but also considers the distribution of these differences on the emotion circumplex.This is specifically achieved by measuring the difference in polar angles between the predicted emotion vector and the actual annotated emotion vector.This method leverages the emotion circumplex model, translating the similarity of emotion categories into geometric angle distances, thereby more intuitively reflecting the relationships between different emotion categories.

(
) In the above formula, the closer the polar angles of the two emotion vectors, the more similar the emotional states they represent.However, merely considering emotion categories may not be sufficient.For example, suppose there are two ENNM images with the same emotion category but significantly different emotion intensities.In this case, it is difficult to consider these two images as having the same emotional state.Therefore, in the construction of the progressive cyclic loss, we further introduce emotion intensity to more detailedly and accurately characterize emotional states.Specifically, this paper incorporates eu as the confidence level of ϕu and ou into polarity loss and category loss, namely: This paper combines the traditional KL loss with the proposed progressive cyclic loss, taking the weighted sum of the two as the loss function of the entire network.The KL loss measures the difference between the predicted distribution and the true distribution, while the PC loss refines the representation of emotional states by introducing emotion intensity and polar angles in the polar coordinate system.Assuming the hyperparameter balancing the two loss functions is represented by ω, the formula is: ( ) Through this method, the model can roughly identify emotion categories in the initial stage and perform finer adjustments and optimizations in subsequent stages by introducing emotion intensity and polarity.This progressive refinement process ensures the accuracy and sensitivity of emotion recognition, particularly in educational scenarios, enabling better capture of subtle changes in students' emotions.

EXPERIMENTAL RESULTS AND ANALYSIS
Tables 1 and 2 present the comparison results of distribution metrics for content annotation of ENNM images on the training set.Our method shows significant advantages in all metric indicators.In Table 1, our method achieves the lowest value of 0.22 in Euclidean distance, significantly better than the Bayes Classifier (0.43) and SVM (0.54).In Manhattan distance and Minkowski distance, our method also achieved values of 0.78 and 0.67, respectively, outperforming other methods.In Bhattacharyya distance and Hellinger distance, our method stands out with values of 0.42 and 0.88, ranking first.Additionally, our method also excels in Intersection similarity and comprehensive ranking (Rank), with values of 0.72 and 1.2, respectively.In terms of accuracy (Acc.), our method achieves 0.71, higher than all other methods.From the results of Tables 3 and 4, it can be seen that our proposed method performed excellently on multiple metric indicators, far surpassing other traditional methods.In Table 3, our method achieved the best values of 0.22, 0.83, and 0.77 in Euclidean distance, Manhattan distance, and Minkowski distance, respectively, showing high precision and robustness.In Bhattacharyya distance and Intersection similarity, our method achieved the best values of 0.42 and 0.72, significantly better than other methods.Moreover, in comprehensive ranking (Rank) and accuracy (Acc.), our method ranked first with values of 1.4 and 0.77, respectively.In contrast, traditional Bayes Classifier, SVM, and machine learning methods performed relatively poorly in most metrics, failing to achieve the best values in any indicator, demonstrating the limitations of traditional methods in new media image content annotation.
In Table 4, our method achieved the best values of 0.22, 0.83, and 0.77 in Euclidean distance, Manhattan distance, and Minkowski distance, respectively, showing high precision and robustness.Especially in Bhattacharyya distance and Intersection similarity, our method achieved the best values of 0.42 and 0.72, significantly better than other methods.Moreover, in comprehensive ranking (Rank) and accuracy (Acc.), our method ranked first with values of 1.4 and 0.77, respectively.In contrast, CNN-based DCNN methods and LSTM methods performed relatively poorly in most metrics, failing to achieve the best values in any indicator, demonstrating the limitations of traditional methods in new media image content annotation.
The results in Table 5 show that our proposed loss function outperformed other combinations in the ablation study on the training set.Specifically, our loss function showed similar performance in Euclidean distance, Manhattan distance, and Minkowski distance compared to other combinations, maintaining low levels and demonstrating high accuracy and stability.However, in Bhattacharyya distance and Hellinger distance, our loss function achieved the best values of 0.402 and 0.887, significantly better than other combinations, especially in Bhattacharyya distance.Additionally, in Intersection similarity and accuracy (Acc.), our loss function also achieved high values of 0.702 and 0.723, demonstrating strong overall performance.From the results in Table 6, it can be seen that our proposed loss function generally outperformed other combinations in the ablation study on the test set.Specifically, our loss function performed best in Euclidean distance and Minkowski distance, with values of 0.223 and 0.765, respectively, showing high accuracy and robustness.In Manhattan distance, our loss function achieved a result of 0.835, slightly inferior to the 0.832 of KL Loss + Polarity Loss + Category Loss, but still performing well.In Bhattacharyya distance, our loss function achieved the lowest value of 0.435, significantly better than other combinations.In Hellinger distance and Intersection similarity, our loss function achieved good results of 0.879 and 0.721, respectively.Finally, in the accuracy (Acc.)metric, our loss function achieved a high value of 0.789, second only to the 0.798 of KL Loss + Polarity Loss + Category Loss.
From the comparison results of annotated values and predicted values of the four dataset samples shown in Figure 5, it can be seen that the method proposed in this paper shows high accuracy and robustness in emotion distribution prediction.Specifically, in the emotion label dataset, the annotated and predicted values are close in the main emotion categories (2 and 3), especially the predicted value of category 2 being 0.58, close to the annotated value of 0.64.In the educational content dataset, the predicted value of category 3 is 0.66, although slightly lower than the annotated value of 0.8, the overall trend matches.In the multimodal dataset, the predicted value of category 3 is 0.42, close to the annotated value of 0.39, and the predicted value of category 4 is 0.28, also not far from the annotated value of 0.24.Finally, in the educational network new media dataset, the predicted value of category 3 is 0.6, close to the annotated value of 0.62, showing overall good performance.
Through the comparative analysis of the four datasets, it can be seen that the method proposed in this paper is effective and advantageous in emotion recognition and content annotation.Through feature re-calibration techniques and progressive cyclic loss functions, our method can accurately reflect the distribution of annotated values in the main emotion categories across different datasets.This indicates that our method not only performed well on a single dataset but also showed strong adaptability and stability across diverse datasets, further verifying its practicality and reliability in content annotation and emotion recognition of ENNM images.Overall, the research results of this paper provided new technical means and methods for this field, with important theoretical value and application prospects.This research provided new technical means and methods for content annotation and emotion recognition of ENNM images, with important theoretical value and practical application prospects.Feature re-calibration technology effectively enhanced the model's ability to capture image details and complex features, while cyclic structural representation, combined with emotion intensity and polarity, achieved more precise recognition of emotional states.However, the method still has some limitations, such as the need to improve prediction accuracy for certain specific emotion categories and the need to further optimize model performance when handling multimodal data.Future research directions can focus on the following aspects: firstly, further improving feature re-calibration technology and cyclic structural representation methods to enhance the recognition accuracy of complex emotional states; secondly, exploring methods for multimodal data fusion to improve the model's adaptability and generalization ability across different datasets; and thirdly, applying the proposed method to more practical scenarios to verify its effectiveness in different fields, and continuously optimizing and improving the model to meet practical needs.Through continuous deepening of research and technological innovation, the proposed method is expected to achieve greater breakthroughs in the field of content annotation and emotion recognition of ENNM images.

Figure 2 .Figure 3 .Figure 4 .
Figure 2. Schematic of added SE modules in ARM and ODM

Figure 5 .
Figure 5. Visualization analysis of content annotation and emotion distribution of ENNM images

Table 1 .
Comparison-1 of distribution metrics for content annotation of ENNM images on the training set

Table 2 .
Comparison-2 of distribution metrics for content annotation of ENNM images on the training set

Table 3 .
Comparison-1 of distribution metrics for content annotation of ENNM images on the test set

Table 4 .
Comparison-2 of distribution metrics for content annotation of educational network new media images on the test set

Table 5 .
Ablation study of the proposed loss function on the training set

Table 6 .
Ablation study of the proposed loss function on the test set