A Multimodal Image Feature Fusion Approach for Crop Health Assessment and Agricultural Yield Prediction

A Multimodal Image Feature Fusion Approach for Crop Health Assessment and Agricultural Yield Prediction

Jun Song* Cong Zhong

School of Management, Harbin University of Commerce, Harbin 150028, China

Corresponding Author Email: 
103350@hrbcu.edu.cn
Page: 
249-263
|
DOI: 
https://doi.org/10.18280/ts.430118
Received: 
30 September 2025
|
Revised: 
3 December 2025
|
Accepted: 
27 December 2025
|
Available online: 
28 February 2026
| Citation

© 2026 The authors. This article is published by IIETA and is licensed under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).

OPEN ACCESS

Abstract: 

Multimodal remote sensing imagery plays a crucial role in crop health assessment and yield prediction, yet it faces challenges such as nonlinear misalignment of heterogeneous features, insufficient dynamic adaptation of modality contributions, and inadequate modeling of spatiotemporal crop growth dynamics in relation to yield. Existing methods often struggle to balance recognition accuracy, prediction reliability, and computational efficiency, limiting their applicability for high-resolution and interpretable precision agriculture monitoring. To address these challenges, this study proposes an end-to-end framework that integrates multimodal adaptive fusion with spatio-temporal graph convolutional network (ST-GCN), enabling unified modeling of feature extraction, fusion, crop health assessment, and yield prediction. A cross-modal adaptive fusion module is designed, combining intra-modal self-attention, inter-modal cross-attention, and adaptive gating mechanisms to effectively achieve dynamic weighting of heterogeneous features and suppress noise. Furthermore, a spatiotemporal graph convolution-based yield prediction model is developed to deeply integrate the spatiotemporal distribution of crop health features with environmental factors, enhancing both interpretability and accuracy. In addition, a multi-task joint optimization strategy coupled with self-supervised contrastive pretraining is introduced to improve model generalization under limited sample conditions. Experimental validation on both publicly available and self-collected multimodal crop datasets demonstrates that the proposed method significantly outperforms state-of-the-art approaches in key metrics such as health status segmentation accuracy and yield prediction error, confirming the effectiveness and superiority of the novel designs. This research provides technical support for the deep integration of image processing technologies with precision agriculture, enriches strategies for multimodal feature fusion and spatiotemporal modeling, and holds significant academic and practical value.

Keywords: 

multimodal image fusion, crop health assessment, yield prediction, spatio-temporal graph convolutional network, attention mechanism

1. Introduction

The rapid development of multimodal remote sensing technology [1, 2] provides efficient and non-destructive technical means for crop health monitoring and yield prediction. Among them, image processing technology, as the core support for mining multi-source remote sensing information [3-5], can achieve precise analysis of crop phenotypic characteristics, physiological states, and growth dynamics, which is of great significance for promoting the development of precision agriculture and ensuring food security. At present, the application of multimodal remote sensing images in crop monitoring still faces three core challenges: the fusion of heterogeneous modal features has nonlinear and dynamic difficulties, making it hard to achieve efficient complementarity among different modal information [6, 7]; the spatiotemporal association modeling between crop health status and yield is insufficient, and it is difficult to accurately capture the influence of crop growth dynamics on yield formation [8]; the generalization ability of models under small-sample scenarios is limited, making it difficult to adapt to complex and variable agricultural production environments [9]. These problems seriously restrict the improvement of monitoring accuracy and prediction reliability.

Existing studies have made some progress in the fields of multimodal image fusion, crop health status recognition, and yield prediction, but there are still obvious limitations. Most multimodal fusion methods [10, 11] adopt fixed weights or simple linear fusion strategies, lacking adaptive capability for heterogeneous modal features, and it is difficult to effectively solve the semantic gap and noise interference problems between different modalities; in the spatiotemporal modeling process, the dynamic evolution characteristics of crop health status are not sufficiently considered [12, 13], and it is difficult to accurately describe spatiotemporal patterns such as disease spread and growth trend changes; most methods adopt a step-by-step processing mode of feature fusion, health recognition, and yield prediction [14, 15], and do not form an integrated end-to-end framework, resulting in information loss during transmission, making it difficult to balance processing efficiency and prediction accuracy. These shortcomings constitute the core research gap of this paper.

The research objective of this paper is to propose an adaptive fusion and spatiotemporal modeling method for multimodal crop image processing, to achieve high-precision crop health status recognition and reliable yield prediction, while considering computational efficiency and model interpretability, and to provide technical support for precision agriculture monitoring. The specific contributions of this paper are as follows:

(1) A modality-specific feature extraction scheme is proposed. For the image characteristics of RGB, multispectral, and thermal infrared modalities, customized backbone networks are designed respectively, breaking through the limitation of using a single backbone network to adapt to multiple modalities, strengthening the targeted capture of crop phenotypic features, spectral information, and temperature distribution, and improving the effectiveness and hierarchy of feature representation.

(2) A multi-scale cross-modal adaptive fusion mechanism is designed, which deeply integrates intra-modal self-attention, inter-modal cross-attention, and adaptive gating units, achieving dynamic alignment and noise suppression of heterogeneous modal features, effectively solving the semantic gap problem in multimodal fusion, and forming a multi-scale enhanced fused feature representation.

(3) A spatiotemporal graph convolution network-driven yield prediction module is constructed, which transforms pixel-level crop health status features into a spatiotemporal graph structure, and combines environmental factors to realize joint modeling of crop growth spatiotemporal dynamics and yield association, significantly improving the accuracy and interpretability of yield prediction.

(4) A multi-task joint optimization and self-supervised contrastive pretraining strategy is proposed. Through multi-task collaborative optimization, the training objectives of health recognition and yield prediction are balanced; self-supervised pretraining is used to learn robust modality-invariant features, effectively solving the problem of insufficient multimodal feature learning under small-sample scenarios, and enhancing the generalization ability of the model.

The remaining sections of this paper are arranged as follows: Section 2 describes in detail the end-to-end algorithm framework proposed in this paper, including the technical details and training strategies of multimodal feature extraction, cross-modal adaptive fusion and health status recognition, and spatiotemporal yield prediction modules; Section 3 verifies the effectiveness, superiority, and stability of the proposed method through ablation, comparative, and generalization experiments, and further provides an in-depth interpretation of the results, analyzing the method’s core advantages, limitations, and potential directions for future research; Section 4 summarizes the work and contributions of this paper, and prospects the academic value and practical application prospects of the research.

2. Method

2.1 Overview of the overall framework

This paper proposes an end-to-end algorithm framework combining multimodal adaptive fusion and ST-GCN for crop health status recognition and yield prediction. The framework takes image processing technology as the core support and runs through the whole process of feature extraction, fusion, segmentation, and spatiotemporal modeling. It mainly includes three closely related core stages. In the multimodal feature extraction stage, targeted multi-scale features are extracted according to the characteristics of different modal images, providing a high-quality basis for subsequent fusion; in the cross-modal adaptive fusion and health status recognition stage, the extracted heterogeneous features are deeply fused and pixel-level health status analysis is realized, providing spatiotemporal distribution information for yield prediction; in the spatiotemporal yield prediction stage, the health status features are transformed into a spatiotemporal graph structure and combined with environmental factors to complete yield regression. The three stages form a closed-loop integrated modeling, breaking through the problems of information loss and low efficiency caused by traditional step-by-step processing, realizing end-to-end optimization from multimodal image input to yield prediction output, and taking into account processing accuracy, computational efficiency, and model interpretability.

2.2 Multimodal feature extraction

RGB images carry key phenotypic features of crops such as color, texture, and morphology, and are the core visual basis for crop health status recognition. The targeting and accuracy of feature extraction directly determine the subsequent fusion effect and recognition performance [16, 17]. Traditional backbone networks generally have problems of insufficient capture of texture details and insufficient scale adaptability of phenotypic features in agricultural image processing. Therefore, this paper customizes and improves the Residual Network with Split-Attention Blocks (ResNeSt) network and uses it as the feature extraction encoder for RGB images. The core innovation focuses on the structural optimization of the Split-Attention module. By reconstructing the internal channel interaction and weight allocation mechanism of the module, the selective capture ability of crop phenotypic features is strengthened, providing hierarchical and discriminative RGB basic features for cross-modal fusion.

The optimization of the Split-Attention module is mainly reflected in the dual improvement of the channel grouping strategy and the attention weight calculation method. For the multi-scale characteristics of crop phenotypic features, the traditional fixed channel grouping mode is abandoned, and an adaptive grouping strategy matching the phenotypic scale of crops is adopted. According to the scale differences of crop leaf texture and canopy morphology in RGB images, the number of groups is dynamically adjusted to ensure that phenotypic features at different scales can be accurately extracted. In the attention weight calculation, a phenotypic feature correlation factor is introduced to optimize the weight allocation logic. The core calculation formula is:

${{\text{ }\!\!\alpha\!\!\text{ }}_{\text{i}}}\text{= }\!\!\sigma\!\!\text{ (W}\cdot \text{GAP(}{{\text{F}}_{\text{i}}}\text{)+b)}$         (1)

where, αi represents the attention weight of the i-th channel group, GAP( ) denotes the global average pooling operation, Fᵢ is the feature map of the i-th channel group, W is the learnable weight matrix, σ is the Sigmoid activation function, and b is the bias term. This improvement makes the attention weights more inclined to channels with significant crop phenotypic features, effectively suppressing background noise interference, solving the problems of texture detail loss and insufficient feature discrimination of the traditional ResNeSt network in agricultural images, and finally outputting multi-scale RGB feature maps covering shallow texture, middle-level morphology, and deep semantic features, providing high-quality support for subsequent cross-modal feature fusion.

Multispectral images contain continuous band information, which can reflect physiological parameters such as chlorophyll content and water status of crop leaves, and are an important basis for crop health diagnosis. The core of feature extraction is to realize the joint capture of spectral and spatial information [18]. Traditional methods use two-dimensional convolutional networks to process multispectral images, which cannot effectively mine the correlation between adjacent bands, resulting in serious spectral information loss and insufficient feature representation capability [19, 20]. Therefore, this paper designs a customized three-dimensional convolutional neural network (3D-CNN). According to the continuity characteristics of multispectral bands, through the optimization of convolution kernel design, stride, and padding strategy, the collaborative extraction of spectral and spatial features is realized, improving the discriminability and integrity of multispectral features.

The core innovation of the customized 3D-CNN lies in the optimization of convolution kernel design and feature enhancement mechanism. A 3 × 3 × 3 three-dimensional convolution kernel is adopted, where the first two dimensions correspond to the image spatial size and the third dimension corresponds to the multispectral band dimension, ensuring that spatial neighborhood information and inter-band correlation features can be captured simultaneously. The core calculation formula of the convolution process is:

${{\text{F}}_{\text{x,y,b}}}\text{= }\!\!\sigma\!\!\text{ }\left( \underset{\text{i=0}}{\overset{\text{2}}{\mathop \sum }}\,\underset{\text{j=0}}{\overset{\text{2}}{\mathop \sum }}\,\underset{\text{k=0}}{\overset{\text{2}}{\mathop \sum }}\,{{\text{W}}_{\text{i,j,k}}}\cdot {{\text{I}}_{\text{x+i,y+j,b+k}}}\text{+b} \right)$        (2)

where, Fx,y,b represents the feature value of the output feature map at position (x,y) and band b, Wi,j,k is the weight of the 3 × 3 × 3 convolution kernel, Ix+i,y+j,b+k is the pixel value of the input multispectral image, b is the bias term, and σ is the ReLU activation function. To reduce spectral information loss, the convolution stride strategy is optimized, and the strides of both spatial and spectral dimensions are set to 1. The “Same” padding method is adopted to ensure that the spatial size and band number of the output feature map are consistent with those of the input image. At the same time, batch normalization layers are embedded to normalize the output features of each layer, reducing feature distribution shift and enhancing feature stability. This effectively solves the problem that traditional two-dimensional convolutional networks cannot capture inter-band correlations and suffer from spectral information loss, and finally outputs multi-scale multispectral feature maps integrating spectral and spatial features.

Thermal infrared images can reflect the temperature distribution of crop canopy and are a key basis for identifying crop stress status, but they generally have problems such as low resolution and blurred temperature anomaly regions. Traditional feature extraction networks are difficult to balance computational efficiency and global temperature capture capability [21, 22]. This paper adopts a lightweight Mobile Vision Transformer (MobileViT) network as the feature extraction backbone for thermal infrared images. The core innovation lies in the simplification of the network structure and the optimization of the attention mechanism. By adjusting network parameters and attention head configuration, the collaborative improvement of global temperature distribution capture and computational efficiency is achieved.

Figure 1. Schematic diagram of modality-specific feature extraction network structure

According to the characteristics of thermal infrared images, two aspects of customized optimization are carried out on the MobileViT network: first, simplifying network parameters by replacing traditional convolution layers with depth-wise separable convolutions to reduce parameter redundancy and computational complexity; second, adjusting the number of attention heads in the Transformer encoder, abandoning the traditional fixed attention head configuration, reducing the number of attention heads from 6 to 4, and adjusting the dimension of attention weight calculation to adapt to the low-resolution characteristics of thermal infrared images. Under the premise of ensuring the ability to capture global temperature distribution, computational efficiency is further improved. The attention weight calculation adopts a simplified scaled dot-product attention mechanism, and the core formula is:

$\text{Attention(Q,K,V)=softmax}\left( \frac{\text{Q}{{\text{K}}^{\text{T}}}}{\sqrt{{{\text{d}}_{\text{k}}}}} \right)\text{V}$          (3)

where, Q, K, and V represent the query, key, and value matrices respectively, and dₖ is the dimension of the query matrix. This optimization enables the network to quickly capture abnormal temperature regions of crop canopy while effectively suppressing background noise interference, solving the problems of low resolution and difficulty in locating temperature anomaly regions in thermal infrared images, and outputting multi-scale thermal infrared feature maps sensitive to temperature changes. Figure 1 shows the schematic diagram of the modality-specific feature extraction network structure.

The RGB, multispectral, and thermal infrared feature extraction branches respectively output multi-scale feature maps covering shallow texture, middle-level semantics, and deep abstraction. The features of each branch complement each other and have different emphases. They not only retain the core information of crop phenotype, spectrum, and temperature, but also have good hierarchical characteristics, providing high-quality basic feature support for subsequent multi-scale cross-modal adaptive fusion, and ensuring that the fusion process can fully mine the complementary value of different modalities.

2.3 Cross-modal adaptive fusion and health status recognition

2.3.1 Cross-modal adaptive fusion module

The core objective of the cross-modal adaptive fusion module is to solve the problems of nonlinear alignment, dynamic weighting, and noise suppression of heterogeneous modal features, to realize deep interaction of multi-scale semantic features, and to provide discriminative and semantically consistent fused features for subsequent health status recognition. The core innovation of this module lies in realizing intra-modal feature enhancement and inter-modal feature alignment in stages. Through customized attention mechanism design, it breaks through the limitation that traditional fusion methods are difficult to balance semantic consistency and heterogeneous information complementarity, ensuring that the core information of different modalities can be fully mined and efficiently fused. Figure 2 shows the internal structure of the multi-scale cross-modal adaptive fusion and aggregation module.

Figure 2. Internal structure of the multi-scale cross-modal adaptive fusion and aggregation module

Intra-modal self-attention enhancement is the basis for achieving effective cross-modal fusion. Its core innovation lies in designing a customized spatial self-attention mechanism for the hierarchical characteristics of multimodal features, strengthening the semantic consistency within a single modality, capturing long-range dependencies within the feature map, and improving the discriminative ability of features. Different from the traditional self-attention mechanism that treats all features equally, the intra-modal self-attention in this paper calculates attention weights on the spatial dimension for multi-scale features output by each branch, highlighting crop health-related feature regions and suppressing background noise interference. The core calculation process is as follows:

$\text{A=softmax}\left( \frac{\text{F}\text{.}{{\text{F}}^{\text{T}}}}{\sqrt{\text{d}}} \right)$        (4)

${{\text{F}}_{\text{enhanced}}}\text{=A}\cdot \text{F}$        (5)

where, F represents the feature map of a certain scale of a single modality, d is the channel dimension of the feature map, A is the generated spatial attention map, and Fenhanced is the feature map after self-attention enhancement. This design generates an attention weight map by calculating the correlation between each pixel and global pixels, so that regions with significant crop health features in the feature map obtain higher weights, effectively solving the problems of isolated local information and insufficient semantic coherence in single-modal features, and providing semantically consistent basic features for subsequent inter-modal alignment.

Inter-modal cross-attention alignment is the core for achieving efficient interaction of heterogeneous modal information. Its innovation lies in reasonably assigning the Query, Key, and Value matrices, and optimizing the dimensional configuration of multi-head attention to achieve accurate alignment and information complementarity of heterogeneous features. In this paper, the RGB branch features are used as Query, and the multispectral and thermal infrared branch features are used as Key and Value respectively to construct a multi-head cross-attention mechanism. With the visual semantic guidance of RGB features, the physiological information of multispectral and the temperature information of thermal infrared are accurately retrieved, realizing the crossing of the semantic gap of heterogeneous modalities. To adapt to the dimensional differences of multimodal features, the dimensional allocation of attention heads is optimized, and the dimension of each attention head is set to match the channel dimension of each modal feature, ensuring the effectiveness of attention calculation.

The core calculation of cross-attention adopts the optimized scaled dot-product attention mechanism, and the specific formulas are as follows:

$\begin{gathered}\text { MultiHead }(Q, K, V) =\text { Concat }\left(\text { head }_1, \text { head }_2, \ldots, \text { head }_{\mathrm{h}}\right) \cdot W_o\end{gathered}$   (6)

$\text{hea}{{\text{d}}_{\text{i}}}\text{=Attention}\left( \text{Q}{{\text{W}}_{{{\text{Q}}_{\text{i}}}}}\text{,K}{{\text{W}}_{{{\text{K}}_{\text{i}}}}}\text{,V}{{\text{W}}_{{{\text{V}}_{\text{i}}}}} \right)$        (7)

where, Q, K, and V represent the feature matrices of RGB, multispectral, and thermal infrared branches respectively, h is the number of attention heads, ${{\text{W}}_{{{\text{Q}}_{\text{i}}}}}$, ${{\text{W}}_{{{\text{K}}_{\text{i}}}}}$, and ${{\text{W}}_{{{\text{V}}_{\text{i}}}}}$ are the learnable projection matrices of the i-th attention head, and Wₒ is the output projection matrix of multi-head attention. This design enables RGB features to accurately retrieve spectral response information related to crop health in multispectral, and temperature signals related to stress status in thermal infrared, realizing accurate alignment and deep interaction of heterogeneous modal features, and providing high-quality aligned features for subsequent adaptive gated fusion.

The adaptive gated fusion unit is the key to achieving efficient fusion of heterogeneous modal features and suppressing invalid noise. Its core innovation lies in designing a scene-adaptive weight adjustment mechanism, which can dynamically allocate the contribution of each modality according to the scene characteristics of input multimodal images, breaking through the limitation that traditional fixed-weight fusion cannot adapt to complex agricultural scenes, and at the same time achieving effective suppression of invalid modal noise and improving the discriminative ability of fused features. This unit takes the features after cross-attention alignment as input, and realizes accurate fusion of heterogeneous features through three steps: feature description, weight mapping, and dynamic fusion.

The specific implementation process of the adaptive gated fusion unit is as follows: first, global average pooling is performed on the features of each modality after cross-attention enhancement, compressing the two-dimensional feature maps into one-dimensional feature description vectors to capture the global feature information of each modality; then the feature description vectors are input into two fully connected layers for dimensional mapping and feature transformation, and the weight coefficients of each modality are obtained through Softmax normalization, ensuring that the sum of weights is 1 and realizing reasonable allocation of modal contributions. The core formulas of weight coefficients and fused features are as follows:

${{\text{v}}_{\text{i}}}\text{=GAP(}{{\text{F}}_{\text{cross,i}}}\text{)}$        (8)

${{\text{w}}_{\text{i}}}\text{=Softmax}\left( {{\text{W}}_{\text{2}}}\cdot \text{ }\!\!\sigma\!\!\text{ }\left( {{\text{W}}_{\text{1}}}\cdot {{\text{v}}_{\text{i}}}\text{+}{{\text{b}}_{\text{1}}} \right)\text{+}{{\text{b}}_{\text{2}}} \right)$          (9)

${{\text{F}}_{\text{fusion,single}}}\text{=}\underset{\text{i=1}}{\overset{\text{3}}{\mathop \sum }}\,{{\text{w}}_{\text{i}}}\cdot {{\text{F}}_{\text{cross,i}}}$         (10)

where, vi is the feature description vector of the i-th modality, GAP( ) is the global average pooling operation, Fcross,i is the feature map of the i-th modality after cross-attention enhancement, W1 and W2 are the learnable weight matrices of the fully connected layers, b1 and b2 are bias terms, σ is the ReLU activation function, wi is the weight coefficient of the i-th modality, and Ffusion,single is the fused feature at a single scale. This paper innovatively designs an adaptive weight adjustment mechanism. Through the scene information contained in the feature description vector, the weights of each modality are dynamically updated. For example, when illumination is insufficient, the quality of RGB modal features decreases, the weight is automatically reduced, while the weight of the thermal infrared modality is increased, ensuring that the fused features can always retain core effective information and suppress invalid noise interference.

Multi-scale fusion aggregation aims to solve the problem that single-scale fusion cannot take into account both detail information and semantic information. Its core innovation lies in repeatedly executing the complete process of intra-modal self-attention enhancement, inter-modal cross-attention alignment, and adaptive gated fusion at different feature scales, and realizing efficient aggregation of multi-scale fused features through skip connections, forming a multi-scale enhanced fused feature representation, and providing sufficient feature support for subsequent high-resolution health status recognition. Multimodal features have obvious hierarchical characteristics: low-level features focus on details such as texture, middle-level features focus on semantic information, and high-level features tend to abstract features. Single-scale fusion is difficult to fully exploit the value of features at each level.

Specifically, the multi-scale fusion aggregation process performs the above fusion process for three feature scales: low-level texture, middle-level semantics, and high-level abstraction, respectively obtaining fused features at three scales. Then, a skip connection mechanism is introduced to upsample the low-level fused features and concatenate them with middle-level fused features, and concatenate the middle-level fused features with high-level fused features, realizing complementarity and aggregation of features at different scales, and finally outputting multi-scale enhanced fused feature maps. The innovation of this design lies in that it does not simply superimpose features at different scales, but retains the core information of features at each scale through skip connections, so that the fused features not only contain low-level details such as crop leaf texture, but also cover middle-level semantics such as disease regions, while taking into account the discriminative ability of high-level abstract features, effectively solving the problems of detail loss or semantic ambiguity in traditional fusion methods, and providing a hierarchical and discriminative feature basis for subsequent pixel-level health status recognition.

2.3.2 Health status recognition module

The core objective of the health status recognition module is to achieve high-resolution pixel-level recognition of crop health status based on the multi-scale features after cross-modal adaptive fusion, and to output a continuous health index at the same time, providing accurate spatiotemporal distribution information for subsequent yield prediction. Figure 3 shows the structure diagram of the lightweight health status recognition decoder with dual-output collaboration. The core innovation of this module lies in the customized improvement of the lightweight decoder and the dual-output collaborative design, which not only solves the problems of low segmentation accuracy, blurred boundaries, and computational redundancy of traditional decoders, but also realizes the integration of discrete health 1 × 1 Convolution status classification and continuous health index regression, improving the comprehensiveness and practicality of recognition results, and fully adapting to the actual needs of agricultural monitoring.

Figure 3. Structure diagram of the lightweight health status recognition decoder with dual-output collaboration

The design of the lightweight decoder is based on an improved Feature Pyramid Network (FPN). The core innovation lies in embedding a channel attention module and optimizing the upsampling and skip connection strategies, significantly improving segmentation accuracy and boundary clarity while ensuring computational efficiency. Different from the traditional FPN decoder that only performs simple feature concatenation and upsampling, this paper adds a customized channel attention module at each feature fusion node of the decoder. By dynamically adjusting feature channel weights, the response of key features such as disease regions and nutrient deficiency regions is enhanced, and the interference of background and invalid features is suppressed. The core calculation formulas of the channel attention module are as follows:

${{\text{M}}_{\text{c}}}\text{= }\!\!\sigma\!\!\text{ }\left( {{\text{W}}_{\text{2}}}\cdot \text{GAP(}{{\text{F}}_{\text{fusion}}}\text{)+}{{\text{b}}_{\text{2}}} \right)$         (11)

${{\text{F}}_{\text{att}}}\text{=}{{\text{M}}_{\text{c}}}\cdot {{\text{F}}_{\text{fusion}}}$           (12)

where, Ffusion is the multi-scale fused feature input to the decoder, GAP( ) is the global average pooling operation, W2 is the learnable weight matrix, b2 is the bias term, σ is the Sigmoid activation function, Mc is the channel attention weight map, and Fatt is the feature after attention enhancement. At the same time, the upsampling strategy is optimized by combining transposed convolution and bilinear interpolation to gradually restore the spatial resolution of the feature map, avoiding blurred boundaries caused by a single upsampling method. The skip connection mechanism is also improved, where the attention-enhanced features of each layer of the decoder are accurately aligned and concatenated with the features of the corresponding scale in the encoder, retaining low-level texture details and high-level semantic information. Finally, a pixel-level classification map with the same size as the input image is generated, effectively solving the problems of low segmentation accuracy and blurred boundaries in traditional decoders.

The dual-output design is another core innovation of this module, aiming to simultaneously realize discrete classification of crop health status and continuous health index regression. Through multi-task collaborative optimization, the comprehensiveness and reliability of recognition results are improved, providing richer spatiotemporal feature support for yield prediction. The discrete classification task targets four typical states: healthy, mild disease, severe disease, and nutrient deficiency, and performs pixel-level category division. The continuous health index regression task outputs continuous values based on Normalized Difference Vegetation Index (NDVI) transformation to quantify the crop health degree. The two tasks complement each other and are jointly optimized, avoiding the limitations of a single output mode.

To realize the collaborative optimization of dual-output tasks, a multi-task joint loss function is designed, which combines classification loss and regression loss with weights, dynamically balancing the training priorities of the two tasks. The core formula is as follows:

${{\text{L}}_{\text{total}}}\text{= }\!\!\alpha\!\!\text{ }\cdot {{\text{L}}_{\text{seg}}}\text{+ }\!\!\beta\!\!\text{ }\cdot {{\text{L}}_{\text{reg}}}$          (13)

where, Ltotal is the total loss function, α and β are the weight coefficients of classification loss and regression loss respectively, satisfying α + β = 1, and can be adaptively adjusted during the training process; Lseg is the classification loss, which adopts the weighted sum of cross-entropy loss and Dice loss to solve the class imbalance problem and improve segmentation accuracy; Lreg is the regression loss, which adopts mean squared error loss to ensure the accuracy of health index regression. This dual-output design enables classification results and regression results to constrain and promote each other through multi-task collaborative training, ensuring accurate classification of health status categories and quantitative evaluation of health degree. Finally, it outputs pixel-level health status segmentation maps and continuous health index maps, providing basic features containing spatiotemporal distribution and quantitative degree of crop health for subsequent spatiotemporal yield prediction.

2.4 Spatiotemporal yield prediction

The construction of spatiotemporal graph structure is the basis for modeling the relationship between crop health status spatiotemporal dynamics and yield. Its core innovation lies in transforming the pixel-level health status features obtained by image processing into graph-structured data suitable for ST-GCN, realizing the deep integration of image processing results and spatiotemporal modeling, breaking through the limitation that traditional spatiotemporal modeling is disconnected from crop health features, and accurately capturing the spatial correlation and temporal evolution law of crop health status, providing interpretable spatiotemporal feature support for subsequent yield prediction. This construction process closely relies on the output of the health status recognition module, transforming discrete pixel-level information into structured graph nodes and edges, ensuring the pertinence and accuracy of spatiotemporal association modeling. Figure 4 shows the schematic diagram of multi-feature fusion and yield prediction.

Figure 4. Schematic diagram of multi-feature fusion and yield prediction

The construction process of the spatiotemporal graph structure is carried out around three core steps: node division, node feature extraction, and edge construction. Each step is customized according to agricultural scenarios and image processing characteristics. Node division adopts a method consistent with actual agricultural production, dividing multi-temporal health status segmentation maps into several nodes according to plots or fixed sampling areas, ensuring that each node corresponds to an independent agricultural monitoring unit and realizing accurate quantification of the spatial distribution of health status. Node feature extraction focuses on the core information of health status. For each node, two types of key statistics are calculated: the area proportion of each health status category and the average health index. The calculation formula of the area proportion of the c-th health status of the k-th node is ${{\text{r}}_{\text{k,c}}}\text{=}{{\text{S}}_{\text{k,c}}}\text{/}\underset{\text{c=1}}{\overset{\text{C}}{\mathop \sum }}\,{{\text{S}}_{\text{k,c}}}$, and the calculation formula of the average health index is ${{\text{h}}_{\text{k}}}\text{=}\underset{\text{i=1}}{\overset{{{\text{N}}_{\text{k}}}}{\mathop \sum }}\,{{\text{h}}_{\text{k,i}}}\text{/}{{\text{N}}_{\text{k}}}$, where Sk,c is the pixel area of the c-th health status of the k-th node, C is the total number of health status categories, hk,i is the health index of the i-th pixel in the k-th node, and Nk is the total number of pixels of the k-th node. Edge construction combines spatial correlation characteristics, considering both Euclidean distance and spatial adjacency relationship between nodes. The Euclidean distance between two nodes u and v is calculated as ${{\text{d}}_{\text{u,v}}}\text{=}\sqrt{{{\text{(}{{\text{x}}_{\text{u}}}\text{-}{{\text{x}}_{\text{v}}}\text{)}}^{\text{2}}}\text{+(}{{\text{y}}_{\text{u}}}\text{-}{{\text{y}}_{\text{v}}}{{\text{)}}^{\text{2}}}}$, where (xu,yu) and (xv,yv) are the spatial coordinates of the two nodes respectively. When the distance is less than a set threshold or there is a spatial adjacency relationship, an edge is constructed between the nodes and assigned a corresponding weight, realizing accurate modeling of spatial correlation of crop health status, and finally forming spatiotemporal graph structure data containing node features, spatial correlation, and time series.

ST-GCN modeling is the core for realizing the mining of the relationship between crop health status spatiotemporal dynamics and yield. Its core innovation lies in customizing and optimizing the traditional ST-GCN. By embedding an attention mechanism in GCN and adding a dropout layer in Gated Recurrent Unit (GRU), it solves the problems of uneven spatial information propagation and easy overfitting in traditional spatiotemporal modeling, realizes accurate capture of crop health evolution laws, and constructs a spatiotemporal feature representation with both accuracy and interpretability. This modeling process takes the constructed spatiotemporal graph structure as input, and is divided into two stages: spatial propagation and temporal capture, realizing integrated modeling of spatiotemporal dynamics of health status, and providing core spatiotemporal feature support for subsequent yield prediction.

The core innovation of the spatial information propagation stage is to embed a spatial attention mechanism in GCN, dynamically adjusting the correlation weights of neighboring nodes, solving the problem that traditional GCN treats all neighboring nodes equally and spatial information propagation is uneven, and strengthening the spatial correlation modeling of crop health status. The optimized node feature update formula of GCN is as follows:

${{\text{F}}_{\text{k,t+1}}}\text{= }\!\!\sigma\!\!\text{ }\left( \underset{\text{v}\in \text{N(k)}}{\mathop \sum }\,{{\text{ }\!\!\alpha\!\!\text{ }}_{\text{k,v}}}\cdot \text{W}\cdot {{\text{F}}_{\text{v,t}}}\text{+b} \right)$           (14)

where, Fk,t+1 is the feature vector of the k-th node after update at step t+1, N(k) is the set of neighboring nodes of the k-th node, αk,v is the attention weight between node k and neighboring node v, calculated from node feature correlation, W is the learnable weight matrix, b is the bias term, and σ is the ReLU activation function. The attention weight αk,v is obtained through Softmax normalization to ensure reasonable weight allocation, so that neighboring nodes with significant health status changes obtain higher weights, accurately capturing spatial correlation patterns such as disease spread. In the temporal dimension capture stage, the updated node feature sequence at each time step is input into the optimized GRU, and a dropout layer is added to the GRU hidden layer with a dropout probability of 0.3 to suppress model overfitting. The core update formulas are:

${{\text{r}}_{\text{t}}}\text{= }\!\!\sigma\!\!\text{ }\left( {{\text{W}}_{\text{r}}}\cdot \text{ }\!\![\!\!\text{ }{{\text{h}}_{\text{t-1}}}\text{,}{{\text{F}}_{\text{t}}}\text{ }\!\!]\!\!\text{ +}{{\text{b}}_{\text{r}}} \right)$       (15)

${{\text{z}}_{\text{t}}}\text{= }\!\!\sigma\!\!\text{ }\left( {{\text{W}}_{\text{z}}}\cdot \text{ }\!\![\!\!\text{ }{{\text{h}}_{\text{t-1}}}\text{,}{{\text{F}}_{\text{t}}}\text{ }\!\!]\!\!\text{ +}{{\text{b}}_{\text{z}}} \right)$     (16)

$\tilde{\mathrm{h}}_t=\tanh \left(W_{\mathrm{h}} \cdot\left[r_t \odot \mathrm{~h}_{t-1}, F_t\right]+b_{\mathrm{h}}\right)$       (17)

$\mathrm{h}_t=\left(1-z_t\right) \odot \mathrm{h}_{t-1}+z_t \odot \tilde{\mathrm{~h}}_t$         (18)

where, rt and zt are the reset gate and update gate respectively, ht−1 and ht are the hidden states at the previous time and current time respectively, $\tilde{\mathrm{h}}_t$ is the candidate hidden state, ⊙ denotes element-wise multiplication, Wr, Wz, and Wh are learnable weight matrices, and br, bz, and bh are bias terms. This optimization enables GRU to accurately capture the temporal evolution law of crop health status while avoiding overfitting, and outputs feature vectors containing spatiotemporal correlation information.

The core innovation of multi-feature fusion prediction lies in designing a feature normalization and attention-weighted fusion mechanism to achieve efficient fusion of spatiotemporal features output by ST-GCN and environmental factors, highlighting the dominant role of spatiotemporal features while taking into account the influence of environmental factors, improving the accuracy and interpretability of yield prediction, and solving the problems of feature weight imbalance and poor interpretability in traditional fusion methods. In this process, the two types of features are first standardized to eliminate dimensional differences and ensure the effectiveness of fusion. The feature normalization formula is:

${{\text{F}}_{\text{norm}}}\text{=}\frac{\text{F- }\!\!\mu\!\!\text{ }}{\text{ }\!\!\sigma\!\!\text{ }}$        (19)

where, Fnorm is the normalized feature, F is the original feature, μ is the feature mean, and σ is the feature standard deviation. Then, a feature attention-weighted fusion module is constructed to dynamically allocate the weights of spatiotemporal features and environmental factors, highlighting the dominant role of spatiotemporal features. The attention weight calculation formulas are:

$\text{ }\!\!\omega\!\!\text{ =Softmax}\left( {{\text{W}}_{\text{a}}}\cdot \text{ }\!\![\!\!\text{ }{{\text{F}}_{\text{st}}}\text{,}{{\text{F}}_{\text{env}}}\text{ }\!\!]\!\!\text{ +}{{\text{b}}_{\text{a}}} \right)$      (20)

${{\text{F}}_{\text{final}}}\text{=}{{\text{ }\!\!\omega\!\!\text{ }}_{\text{1}}}\cdot {{\text{F}}_{\text{st}}}\text{+}{{\text{ }\!\!\omega\!\!\text{ }}_{\text{2}}}\cdot {{\text{F}}_{\text{env}}}$      (21)

where, Fst is the spatiotemporal feature vector output by ST-GCN, Fenv is the global feature vector of environmental factors, Wa is the learnable weight matrix, and ba is the bias term, and ω1 and ω2 are the attention weights of spatiotemporal features and environmental factors respectively.

The fused feature vector is input into a multi-layer fully connected network for yield regression. The network uses ReLU as the activation function, and the output layer uses a linear activation function to ensure the continuity and rationality of the yield prediction value. The output formula of the fully connected network is:

$\hat{y}=W_o \cdot \sigma\left(W_3 \cdot F_{\text {final }}+b_3\right)+b_o$           (22)

where, $\hat{y}$ is the final yield prediction value, W3 and Wo are the learnable weight matrices of the fully connected layers, and b3 and bo are bias terms. This multi-feature fusion design realizes reasonable allocation of features through attention weighting, enabling the model to comprehensively consider the spatiotemporal dynamics of crop health status and the collaborative influence of external environmental factors, significantly improving the accuracy and interpretability of yield prediction, and finally outputting accurate crop yield prediction values.

2.5 Training strategy

The core innovation of the training strategy lies in deeply combining multi-task joint optimization with self-supervised contrastive pretraining. Aiming at the key problems of imbalance in multi-task collaborative training and insufficient feature learning in small-sample scenarios, a dedicated training process and loss functions are designed to achieve the coordinated improvement of model training efficiency, recognition accuracy, and generalization ability, providing guarantees for the stable performance of the end-to-end framework. This strategy closely matches the characteristics of multimodal image processing and agricultural scenarios. Through customized pretraining and fine-tuning mechanisms, it addresses the limitations of traditional training methods such as weak generalization ability and imbalance in multi-task training, ensuring that the model can still maintain excellent performance in complex agricultural scenarios and small-sample conditions.

The core innovation of multi-task joint optimization lies in designing a hybrid loss function with adaptive weight adjustment to balance the training priorities of health state recognition and yield prediction tasks, while addressing the class imbalance problem in health state segmentation. The loss function is composed of a weighted combination of pixel-level segmentation loss and yield regression loss. The segmentation loss adopts a weighted sum of cross-entropy loss and Dice loss, effectively alleviating the training bias caused by class imbalance, while the regression loss adopts mean squared error loss to ensure the accuracy of yield prediction. The core formulas are as follows:

${{\text{L}}_{\text{total}}}\text{= }\!\!\alpha\!\!\text{ (t)}\cdot {{\text{L}}_{\text{seg}}}\text{+ }\!\!\beta\!\!\text{ (t)}\cdot {{\text{L}}_{\text{reg}}}$        (23)

${{\text{L}}_{\text{seg}}}\text{= }\!\!\gamma\!\!\text{ }\cdot {{\text{L}}_{\text{ce}}}\text{+(1- }\!\!\gamma\!\!\text{ )}\cdot {{\text{L}}_{\text{dice}}}$          (24)

${{\text{L}}_{\text{dice}}}\text{=1-}\frac{\text{2}\mathop{\sum }_{\text{i=1}}^{\text{N}}{{\text{p}}_{\text{i}}}{{\text{g}}_{\text{i}}}\text{+}\epsilon }{\mathop{\sum }_{\text{i=1}}^{\text{N}}\text{p}_{\text{i}}^{\text{2}}\text{+}\mathop{\sum }_{\text{i=1}}^{\text{N}}\text{g}_{\text{i}}^{\text{2}}\text{+}\epsilon }$       (25)

$L_{\text {reg }}=\frac{1}{M} \sum_{j=1}^M\left(y_j-\bar{y}_j\right)^2$          (26)

where, Ltotal is the total loss function, α(t) and β(t) are adaptive weight coefficients satisfying α(t)+β(t)=1, and their values are dynamically updated according to the convergence speed of the two tasks during training. When a task converges more slowly, its corresponding weight is automatically increased, achieving dynamic balancing of task priorities. Lseg is the segmentation loss, γ is the weight coefficient between cross-entropy loss Lce and Dice loss Ldice, and ϵ is a smoothing term to avoid division by zero. pi and gi are the predicted value and ground truth label, respectively. Lreg is the regression loss, M is the number of samples, and yj and ${{\overset{}{\mathop{\text{y}}}\,}_{\text{j}}}$ are the ground truth yield and predicted yield, respectively. This design solves the imbalance problem in multi-task training through adaptive weight adjustment, and alleviates class imbalance through hybrid segmentation loss, ensuring that the two tasks are jointly optimized and improved simultaneously.

The core innovation of self-supervised contrastive pretraining lies in designing customized data augmentation strategies and contrastive learning schemes for multimodal image characteristics. Robust modality-invariant features are learned on unlabeled data, addressing the problem of insufficient multimodal feature learning in small-sample scenarios, and providing high-quality initialization parameters for the feature extraction network. The pretraining process is based on the Simple Framework for Contrastive Learning of Visual Representations (SimCLR) framework. The core idea is to construct positive and negative sample pairs through data augmentation, pulling closer the feature distance of different augmented views of the same image, and pushing away the feature distance of different images, thereby achieving effective learning of modality-invariant features.

Specifically, for RGB, multispectral, and thermal infrared modalities, differentiated data augmentation strategies are designed: RGB images adopt random cropping, horizontal flipping, and brightness adjustment; multispectral images adopt band perturbation and random cropping; thermal infrared images adopt brightness adjustment and random cropping, ensuring that the core features of each modality are retained after augmentation. The core formula of the contrastive loss is as follows:

${{\text{L}}_{\text{contrast}}}\text{=-}\frac{\text{1}}{\text{2N}}\underset{\text{k=1}}{\overset{\text{N}}{\mathop \sum }}\,\text{log}\frac{\text{exp(sim(}{{\text{z}}_{\text{k}}}\text{,z}_{\text{k}}^{\text{+}}\text{)/ }\!\!\tau\!\!\text{ )}}{\mathop{\sum }_{\text{l=1}}^{\text{2N}}{{\text{I}}_{\text{l}\ne \text{k}}}\text{exp(sim(}{{\text{z}}_{\text{k}}}\text{,z}_{\text{l}}^{\text{+}}\text{)/ }\!\!\tau\!\!\text{ )}}$          (27)

where, N is the number of samples in a training batch, zk is the feature embedding of the k-th sample, $\text{z}_{\text{k}}^{\text{+}}$ is the feature embedding of the positive sample of the k-th sample, τ is the temperature parameter, sim( ) is the cosine similarity function, and I is the indicator function. The pretraining process is only applied to the feature extraction networks of the three modalities. By minimizing the contrastive loss, robust features that are not affected by modality differences and augmentation perturbations are learned, providing good initialization for the subsequent fine-tuning of the end-to-end framework, and effectively improving the generalization ability of the model in small-sample scenarios.

The training process adopts a two-stage mode of “pretraining-fine-tuning” to further improve training efficiency and model performance. First, self-supervised contrastive pretraining is performed on unlabeled multimodal images for the feature extraction networks to obtain initialization parameters. Then, the pretrained parameters are transferred to the entire end-to-end framework, and fine-tuning is performed using the multi-task joint optimization loss function to simultaneously optimize the two tasks of health state recognition and yield prediction. This process not only solves the problem of insufficient feature learning in small-sample scenarios through pretraining, but also achieves collaborative adaptation of each module through fine-tuning, ensuring that the model accurately learns features while achieving efficient collaboration between the two tasks, ultimately improving recognition accuracy, prediction reliability, and generalization ability of the model.

3. Experiments and Results Analysis

3.1 Experimental setup

This experiment aims to comprehensively verify the effectiveness, superiority, and stability of the proposed end-to-end algorithm framework. Focusing on four major innovations, including multimodal feature extraction, cross-modal adaptive fusion, spatiotemporal yield prediction, and training strategy, ablation experiments, comparative experiments, generalization experiments, and stability experiments are designed. The experimental setup strictly follows the specifications of image processing journals to ensure reproducibility and rigor of the experiments. The dataset adopts a combination of public datasets and self-constructed datasets to ensure representativeness and complexity of the experiments. The public datasets include Agriculture-Vision and Unmanned Aerial Vehicle (UAV)-Multispectral Crop Dataset, covering two major crops, wheat and maize, and including three modalities: RGB, multispectral (8 bands), and thermal infrared images. The temporal range covers from the jointing stage to the maturity stage of crops, with a total of 12 time steps. The annotation information includes pixel-level labels of four types of health states (healthy, mild disease, severe disease, nutrient deficiency) and the corresponding actual yield data of fields. The self-constructed dataset is collected from the main wheat-producing areas in northern China and the main maize-producing areas in southern China, covering crop samples under different climate conditions and soil types. It supplements 1500 groups of multimodal images and corresponding annotations to alleviate the problem of single-scene limitation in public datasets. The two types of datasets are uniformly divided into training set, validation set, and test set according to the ratio of 7:2:1. The training set is used for model training and pretraining, the validation set is used for parameter tuning, and the test set is used for model performance evaluation.

The experimental environment configuration is as follows: the hardware adopts NVIDIA RTX 4090 GPU (24GB memory), Intel Core i9-13900K CPU, and 64GB DDR5 memory; the software framework is based on PyTorch 1.13.1, combined with OpenCV and Scikit-learn to complete data preprocessing and metric calculation. The training parameters are set as follows: the initial learning rate is 0.001, the AdamW optimizer is adopted (weight decay 0.0001), the batch size is set to 16, the number of iterations is 200, and an early stopping strategy (patience = 20) is adopted to prevent overfitting. The dropout probability is set to 0.3, and the temperature parameter τ = 0.1.

3.2 Ablation experiments

The ablation experiments adopt the control variable method. The proposed innovative modules are removed or replaced one by one to construct five variant models, which are compared with the complete model in this paper to verify the effectiveness of each innovative module. The experimental results are shown in Table 1.

Table 1. Comparison of ablation experiment results

Model Variant

Health State Recognition

Yield Prediction

Computational Efficiency

Intersection over Union (%)

Precision (%)

Recall (%)

F1-score (%)

Mean Absolute Error (kg/ mu)

Root Mean Square Error (kg/ mu)

Inference Time (ms)

Parameters (M)

Variant 1: Single backbone feature extraction

72.35

75.12

73.89

74.49

18.76

25.32

0.821

42.6

38.7

Variant 2: Without intra-modal self-attention

78.62

80.35

79.41

79.87

15.42

21.58

0.867

39.8

45.2

Variant 3: Without adaptive gating unit

79.15

81.02

80.13

80.57

14.89

20.75

0.875

38.5

44.9

Variant 4: Without spatio-temporal graph convolutional network modelling

80.23

82.15

81.36

81.75

16.93

23.17

0.852

35.7

40.3

Variant 5: Without self-supervised pretraining

79.86

81.78

80.95

81.36

15.17

21.24

0.871

37.9

45.5

Full model (proposed)

83.57

85.26

84.68

84.97

12.35

17.89

0.913

38.2

45.7

From the experimental results in Table 1, it can be seen that each innovative module can significantly improve the model performance, verifying the rationality and effectiveness of the design in this paper. Variant 1 adopts a single backbone network to adapt to three modalities. Compared with the full model in this paper, Intersection over Union (IoU) decreases by 11.22%, F1-score decreases by 10.48%, Mean Absolute Error (MAE) increases by 6.41 kg/mu, and R² decreases by 0.092. This indicates that modality-specific feature extraction branches can accurately extract features according to different modality characteristics, effectively improving the specificity and discriminability of feature representation, and addressing the limitation that a single backbone network cannot adapt to multimodality.

Variant 2 removes intra-modal self-attention, resulting in a decrease of 5.10% in F1-score and a decrease of 0.046 in R². This indicates that intra-modal self-attention can enhance the semantic consistency of a single modality, capture long-range dependencies within feature maps, improve the discriminative ability of features, and reduce background noise interference. Variant 3 removes the adaptive gating unit, resulting in a decrease of 4.40% in F1-score and an increase of 2.86 kg/mu in Root Mean Square Error (RMSE). This indicates that the adaptive gating unit can dynamically adjust the contribution of each modality, suppress invalid noise, achieve efficient fusion of heterogeneous features, and avoid feature redundancy and information loss caused by fixed-weight fusion.

Variant 4 removes ST-GCN spatiotemporal modeling, resulting in a decrease of 0.061 in R² and an increase of 4.58 kg/mu in MAE, while the decline in health state recognition metrics is relatively moderate. This indicates that ST-GCN can effectively capture the spatiotemporal evolution patterns of crop health states, achieve accurate association modeling between health states and yield, and improve the accuracy and interpretability of yield prediction. Variant 5 removes self-supervised pretraining, resulting in a decrease of 3.61% in F1-score and a decrease of 0.042 in R², verifying that self-supervised pretraining can learn robust modality-invariant features on unlabeled data, provide high-quality initialization for the feature extraction network, and improve the generalization ability of the model.

3.3 Comparative experiments

To verify the superiority of the proposed method, advanced image processing methods related to multimodal fusion, crop health recognition, and yield prediction in the current field are selected as comparison models, including: 1) CNN-Fusion: a traditional convolution-based multimodal linear fusion method; 2) Vision Transformer Fusion (ViT-Fusion): a Transformer-based multimodal fusion method; 3) Spatiotemporal Convolutional Long Short-Term Memory Network (ST-ConvLSTM): a spatiotemporal convolutional long short-term memory-based prediction method; 4) Multi-modal Feature Fusion Graph Convolutional Network (MFF-GCN): a prediction method based on multimodal feature fusion and graph convolution; 5) Multi-Scale Attention Fusion Network (MSAF-Net): a crop health recognition method based on multi-scale attention fusion. The comparative experiments are conducted under the same dataset and experimental environment, and the core metric comparison results are shown in Table 2.

Table 2. Comparison of comparative experiment results

Method

Health State Recognition

Yield Prediction

Computational Efficiency

Intersection over Union (%)

Precision (%)

Recall (%)

F1-score (%)

Mean Absolute Error (kg/ mu)

Root Mean Square Error (kg/ mu)

Inference Time (ms)

Parameters (M)

Convolutional Neural Network -Fusion

74.18

76.35

75.22

75.77

20.15

27.46

0.803

36.8

32.4

Vision Transformer -Fusion

78.95

80.72

79.86

80.28

17.32

23.89

0.847

52.3

68.9

Spatiotemporal Convolutional Long Short-Term Memory Network

77.53

79.21

78.45

78.82

16.87

23.05

0.854

48.6

56.7

Multi-modal Feature Fusion Graph Convolutional Network

80.12

82.03

81.25

81.63

15.06

20.98

0.876

43.5

51.2

Multi-Scale Attention Fusion Network

81.36

83.15

82.47

82.80

18.24

24.53

0.839

40.1

48.5

Proposed Method

83.57

85.26

84.68

84.97

12.35

17.89

0.913

38.2

45.7

From the experimental results in Table 2, it can be seen that the proposed method is significantly superior to the comparison methods on all core metrics, fully demonstrating its superiority. In the health state recognition task, the IoU and F1-score of the proposed method reach 83.57% and 84.97%, respectively. Compared with the best comparison method MSAF-Net, they are improved by 2.21% and 2.17%, respectively; compared with the traditional method CNN-Fusion, they are improved by 9.39% and 9.20%, respectively. This indicates that the proposed cross-modal adaptive fusion mechanism can effectively solve the problems of heterogeneous feature alignment and noise suppression, improving segmentation accuracy and feature discriminability.

In the yield prediction task, the MAE and RMSE of the proposed method are reduced to 12.35 kg/mu and 17.89 kg/mu, respectively, and R² reaches 0.913. Compared with the best comparison method MFF-GCN, MAE is reduced by 2.71 kg/mu, RMSE is reduced by 3.09 kg/mu, and R² is improved by 0.037. This indicates that the design of ST-GCN spatiotemporal modeling and multi-feature fusion prediction can accurately capture the spatiotemporal dynamics of crop health states and the synergistic effects of environmental factors, improving the accuracy and reliability of yield prediction.

In terms of computational efficiency, the inference time of the proposed method is 38.2 ms and the number of parameters is 45.7M, which is better than ViT-Fusion, ST-ConvLSTM, MFF-GCN, and MSAF-Net, and only slightly higher than CNN-Fusion, achieving a coordinated improvement of accuracy and efficiency, and having practical deployment value. Although ViT-Fusion can achieve a certain fusion effect, it has a large number of parameters and slow inference speed, making it difficult to adapt to the real-time monitoring requirements of agricultural scenarios. MFF-GCN lacks effective multi-scale fusion and adaptive weight adjustment mechanisms, and cannot fully exploit the complementary value of heterogeneous modalities. MSAF-Net only focuses on health state recognition and does not achieve integrated modeling with yield prediction, resulting in limited prediction performance.

3.4 Generalization experiments and stability analysis

To verify the cross-scenario adaptability of the proposed method, generalization experiments are designed. Three different test scenarios are selected: different crop (rice), different region (wheat in the northwest arid region), and different temporal sequence (crop growth period advanced by 2 time steps). The performance of the proposed method under each scenario is tested, and the experimental results are shown in Table 3.

Table 3. Generalization experiment results

Test Scenario

Health State Recognition

Yield Prediction

Intersection over Union (%)

Precision (%)

Recall (%)

F1-score (%)

Mean Absolute Error (kg/ mu)

Root Mean Square Error (kg/mu)

Original scenario (wheat, maize)

83.57

85.26

84.68

84.97

12.35

17.89

0.913

Different crop (rice)

80.12

82.05

81.37

81.71

13.89

19.67

0.889

Different region (wheat in northwest arid region)

79.45

81.32

80.68

81.00

14.56

20.34

0.878

Different temporal sequence (advanced by 2 time steps)

78.63

80.57

79.89

80.23

15.12

21.08

0.867

From the experimental results in Table 3, it can be seen that the proposed method can maintain good performance under different crops, different regions, and different temporal scenarios, showing strong generalization ability. Compared with the original scenario, IoU decreases by 3.45% and R² decreases by 0.024 in the different crop scenario; IoU decreases by 4.12% and R² decreases by 0.035 in the different region scenario; IoU decreases by 4.94% and R² decreases by 0.046 in the different temporal scenario. The performance degradation is within 5%, indicating that the proposed method can effectively adapt to phenotypic differences of different crops, environmental differences of different regions, and growth state differences of different temporal sequences, addressing the problem of weak generalization ability of traditional methods. This is mainly due to the robust modality-invariant features learned by the self-supervised pretraining strategy, as well as the adaptability of modality-specific feature extraction and adaptive fusion mechanisms to different scenarios, which can effectively mine the core features of crop health states in different scenarios and achieve accurate recognition and prediction.

To verify the stability of the training of the proposed method, the complete model is trained repeatedly for 10 times, and the mean and standard deviation of each core metric are calculated. The experimental results are shown in Figure 5.

From the experimental results in Figure 5, it can be seen that after 10 repeated trainings, the standard deviations of all core metrics of the proposed method are small: the standard deviation of IoU is 0.21%, the standard deviation of F1-score is 0.18%, the standard deviation of MAE is 0.23 kg/mu, the standard deviation of RMSE is 0.27 kg/mu, and the standard deviation of R² is 0.005. This indicates that the training process of the proposed method is stable, without obvious fluctuations, and the model performance has good consistency. This is due to the multi-task joint optimization strategy and adaptive weight adjustment mechanism designed in this paper, which can effectively avoid problems such as overfitting and unstable convergence during training, ensuring that the model can achieve stable and excellent performance in multiple trainings.

Figure 5. Stability experiment results (10 repeated trainings)

To verify the advantages of the self-supervised pretraining strategy in small-sample scenarios, small-sample experiments are designed. The training set sample sizes are selected as 10%, 20%, 30%, 50%, and 70% of the original sample size, respectively. The performance of the proposed method (with self-supervised pretraining) and the method without self-supervised pretraining is tested. The experimental results are shown in Figure 6 and Table 4.

From the experimental results in Figure 6 and Table 4, it can be seen that in small-sample scenarios, the performance of the proposed method (with self-supervised pretraining) is significantly better than that of the method without pretraining, and the smaller the sample size, the more obvious the advantage. When the training sample size is only 10% of the original sample size, the F1-score of the proposed method is improved by 7.47%, R² is improved by 0.064, and RMSE is reduced by 5.57 kg/mu compared with the method without pretraining. This indicates that self-supervised pretraining can learn robust modality-invariant features on unlabeled data, effectively solving the problem of insufficient feature learning in small-sample scenarios, providing high-quality initialization parameters for the model, and improving the generalization ability and recognition accuracy of the model. As the training sample size increases, the performance gap between the two methods gradually decreases, but the proposed method still maintains an advantage, verifying the effectiveness and necessity of the self-supervised pretraining strategy, enabling the proposed method to adapt to agricultural monitoring scenarios with limited sample sizes, and further improving its practical application value.

Figure 6. Comparison of small-sample method performance results

Table 4. Comparison of small-sample yield prediction results

Training Sample Ratio

10%

20%

30%

50%

70%

100%

Method

Without pretraining

Proposed method

Without pretraining

Proposed method

Without pretraining

Proposed method

Without pretraining

Proposed method

Without pretraining

Proposed method

Without pretraining

Proposed method

Yield Prediction

Root Mean Square Error (kg/mu)

28.75

23.18

25.42

20.57

23.15

19.23

20.78

18.56

19.32

18.01

17.53

17.89

Figure 7. Cross-modal adaptive fusion feature response heatmap

Figure 8. Pixel-level crop health state segmentation results

To visually verify the effectiveness of the proposed multimodal adaptive fusion mechanism and health state recognition module, visualization experiments of multimodal feature responses and pixel-level health state segmentation are designed. Quantitative and qualitative analysis of feature enhancement and segmentation performance of the algorithm in complex farmland scenarios is conducted. The cross-modal adaptive fusion feature response heatmap in Figure 7 adopts normalized [0,1] cold-warm color mapping. The highlighted regions correspond to high-response features related to crop health. The feature response in healthy regions is uniform and stable, mild disease regions show gradient enhancement, severe disease regions show dense high-intensity centers, and nutrient deficiency regions show continuous and smooth activation. Background noise is effectively suppressed in complex mixed cropping and low-light scenarios, and feature activation in crop regions is uniform without distortion, fully verifying the collaborative effects of intra-modal self-attention, cross-attention, and adaptive gating units in heterogeneous feature enhancement, noise suppression, and semantic consistency improvement. The pixel-level health state segmentation map in Figure 8 adopts four-color standardized annotation. The segmentation results have smooth edges and complete closed regions, without fragments or misclassification. Disease and nutrient deficiency regions are accurately located. In complex mixed cropping scenarios, the health state partitions of different crops are clear. In low-light and weak-texture regions, complete segmentation is still maintained, reflecting the high-resolution parsing capability of multi-scale fused features and the lightweight decoder, corresponding to the feature heatmap.

This visualization experiment fully demonstrates that the proposed multimodal adaptive fusion framework can effectively mine the complementary information of RGB, multispectral, and thermal infrared modalities, achieve precise alignment and dynamic weighting of heterogeneous features, and provide reliable feature support for high-resolution recognition of crop health states and subsequent yield prediction, highlighting the robustness and superiority of the method in complex agricultural scenarios.

4. Conclusion

This paper addressed key problems in multimodal remote sensing images for crop health recognition and yield prediction, including difficulty in nonlinear alignment of heterogeneous features, insufficient dynamic adaptation of modality contribution, inadequate spatiotemporal association modeling, and limited generalization ability in small-sample scenarios. Focusing on image processing technology innovation, an end-to-end algorithm framework based on multimodal adaptive fusion and ST-GCN was proposed, systematically completing the integrated modeling of multimodal feature extraction, cross-modal fusion, health state recognition, and yield prediction. The core innovations lie in designing modality-specific feature extraction branches to achieve targeted capture of different modality features; constructing a multi-scale cross-modal adaptive fusion mechanism, which effectively solved the semantic gap and noise suppression problems of heterogeneous modalities through the combination of intra-modal self-attention, inter-modal cross-attention, and adaptive gating units; proposing an ST-GCN-driven spatiotemporal modeling method to achieve accurate association between the spatiotemporal dynamics of crop health states and yield; and designing a multi-task joint optimization and self-supervised contrastive pretraining strategy to balance task priorities and improve generalization ability in small-sample scenarios. Experimental results fully verify that the proposed method significantly outperformed existing advanced methods in health state recognition accuracy, yield prediction reliability, and computational efficiency. Each innovative module effectively improved model performance, demonstrating good stability and generalization ability.

The proposed method forms an innovative paradigm in the deep integration of image processing technology and agricultural applications. It not only enriches the research ideas of multimodal feature fusion and spatiotemporal modeling, providing a new technical pathway for heterogeneous multimodal image processing with important academic value, but also achieves high-resolution recognition of crop health states and reliable yield prediction, providing efficient technical support for precision agricultural monitoring and grain yield estimation, with strong practical application prospects. In future work, focusing on the limitations of the method and combining frontier technologies in the field of image processing, further optimization will be conducted on multimodal data alignment accuracy, lightweight spatiotemporal modeling architectures will be designed to improve computational efficiency, contrastive learning and few-shot learning will be integrated to enhance the recognition ability of extreme diseases in small-sample scenarios, and multimodal data types such as hyperspectral data will be expanded, continuously improving algorithm performance and promoting the in-depth application and development of image processing technology in the field of precision agriculture.

Acknowledgment

Supported by Program for Young Talents of Basic Research in Universities of Heilongjiang Province (Grant No.: YQJH2024090); Funded by the 2024 Basic Scientific Research Operating Expenses Project for Provincial Undergraduate Universities in Heilongjiang Province.

  References

[1] Xu, Y., Cao, L., Li, J., Li, W., Li, Y., Zong, Y., Wang, A., Rao, Y., Deng, S. (2025). S2RCFormer: Spatial-spectral residual cross-attention transformer for multimodal remote sensing data classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 18: 16176-16193. https://doi.org/10.1109/JSTARS.2025.3582083

[2] Wang, W., Mu, K., Liu, H. (2025). A multihierarchy flow field prediction network for multimodal remote sensing image registration. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 18: 5232-5243. https://doi.org/10.1109/JSTARS.2025.3532939

[3] Papakis, I., Linardos, V., Drakaki, M. (2025). A multimodal ensemble deep learning model for wildfire prediction in greece using satellite imagery and multi-source remote sensing data. Remote Sensing, 17(19): 3310. https://doi.org/10.3390/rs17193310

[4] Fathi, M., Shah-Hosseini, R., Moghimi, A., Arefi, H. (2024). MHRA-MS-3D-ResNet-BiLSTM: A multi-head-residual attention-based multi-stream deep learning model for soybean yield prediction in the US using multi-source remote sensing data. Remote Sensing, 17(1): 107. https://doi.org/10.3390/rs17010107

[5] Win, K., Sato, T., Tsuyuki, S. (2024). Application of multi-source remote sensing data and machine learning for surface soil moisture mapping in temperate forests of central Japan. Information, 15(8): 485. https://doi.org/10.3390/info15080485

[6] Kalamkar, S., Amalanathan, G.M. (2025). MDA-ViT: Multimodal image fusion using dual attention vision transformer. Multimedia Tools and Applications, 84(21): 23701-23723. https://doi.org/10.1007/s11042-024-19968-1

[7] Wang, J., Qiu, S., Cai, J., Zhang, X. (2025). Weighted feature fusion network based on large kernel convolution and Transformer for multi-modal remote sensing image segmentation. IEEE Access, 13: 145319-145333. https://doi.org/10.1109/ACCESS.2025.3598116

[8] Gurmessa, B., Cocco, S., Ashworth, A.J., Udawatta, R.P., et al. (2024). Short term effects of digestate and composted digestate on soil health and crop yield: Implications for sustainable biowaste management in the bioenergy sector. Science of the Total Environment, 906: 167208. https://doi.org/10.1016/j.scitotenv.2023.167208

[9] Colon-Berrios, J.R., Nanzer, J.A. (2026). Frequency-diverse additive processing for active incoherent millimeter-wave imaging. IEEE Open Journal of Antennas and Propagation, 7(1): 184-193. https://doi.org/10.1109/OJAP.2025.3638751

[10] Bhateja, V., Nigam, M., Bhadauria, A.S., Arya, A. (2020). Two-stage multi-modal MR images fusion method based on Parametric Logarithmic Image Processing (PLIP) model. Pattern Recognition Letters, 136: 25-30. https://doi.org/10.1016/j.patrec.2020.05.027

[11] Xu, Y., Li, X., Wang, Y., Cheng, X., Li, H., Tan, H. (2025). FlexiD-Fuse: Flexible number of inputs multi-modal medical image fusion based on diffusion model. Expert Systems with Applications, 296: 128895. https://doi.org/10.1016/j.eswa.2025.128895

[12] Hubert, A., Conan, B., Aubrun, S. (2025). Spatiotemporal behavior of the far wake of a wind turbine model subjected to harmonic motions: Phase averaging applied to stereo particle image velocimetry measurements. Wind Energy Science, 10(7): 1351-1368. https://doi.org/10.5194/wes-10-1351-2025

[13] Lukac, R., Plataniotis, K.N. (2005). Fast video demosaicking solution for mobile phone imaging applications. IEEE Transactions on Consumer Electronics, 51(2): 675-681. https://doi.org/10.1109/TCE.2005.1468018

[14] Shrestha, A., Bheemanahalli, R., Adeli, A., Samiappan, S., et al. (2023). Phenological stage and vegetation index for predicting corn yield under rainfed environments. Frontiers in Plant Science, 14: 1168732. https://doi.org/10.3389/fpls.2023.1168732

[15] Ingole, V.S., Kshirsagar, U.A., Singh, V., Yadav, M.V., Krishna, B., Kumar, R. (2024). A hybrid model for soybean yield prediction integrating convolutional neural networks, recurrent neural networks, and graph convolutional networks. Computation, 13(1): 4. https://doi.org/10.3390/computation13010004

[16] Kitzler, F., Barta, N., Neugschwandtner, R.W., Gronauer, A., Motsch, V. (2023). WE3DS: An RGB-D image dataset for semantic segmentation in agriculture. Sensors, 23(5): 2713. https://doi.org/10.3390/s23052713

[17] Sun, H., Xing, Z.Z., Qiao, L., Long, Y.W., Gao, D.H., Li, M.Z., Qin, Z. (2019). Spectral imaging detection of crop chlorophyll distribution based on optical saturation effect correction. Spectroscopy and Spectral Analysis, 39(12): 3897-3903.

[18] Li, Z., Xiao, Z., Zhou, Y., Bao, T. (2025). Typical crop classification of agricultural multispectral remote sensing images by fusing multi-attention mechanism ResNet networks. Sensors, 25(7): 2237. https://doi.org/10.3390/s25072237

[19] Noh, H., Zhang, Q., Han, S., Shin, B., Reum, D. (2005). Dynamic calibration and image segmentation methods for multispectral imaging crop nitrogen deficiency sensors. Transactions of the ASAE, 48(1): 393-401. https://doi.org/10.13031/2013.17933

[20] Wang, H., Qian, X., Zhang, L., Xu, S., et al. (2018). A method of high throughput monitoring crop physiology using chlorophyll fluorescence and multispectral imaging. Frontiers in Plant Science, 9: 407. https://doi.org/10.3389/fpls.2018.00407

[21] Liu, M., Guan, H., Ma, X., Yu, S., Liu, G. (2020). Recognition method of thermal infrared images of plant canopies based on the characteristic registration of heterogeneous images. Computers and Electronics in Agriculture, 177: 105678. https://doi.org/10.1016/j.compag.2020.105678

[22] Fuentes, S., De Bei, R., Pech, J., Tyerman, S. (2012). Computational water stress indices obtained from thermal image analysis of grapevine canopies. Irrigation Science, 30(6): 523-536. https://doi.org/10.1007/s00271-012-0375-8