The Deep Learning Methods for Fusion Infrared and Visible Images: A Survey

The use of deep learning techniques in the infrared and visible picture fusion domain has dramatically enhanced the effectiveness of image fusion approaches. Deep learning (DL) has significantly boosted the fusion process, improving efficiency and efficacy. This advancement has produced fused images that exhibit a broad spectrum of possible applications. Nevertheless, additional investigation and innovation are required to tackle the difficulties and debates related to the utilization of DL in picture fusion, guaranteeing the ongoing progress of this domain. Various imaging techniques are at one's disposal to capture and present information within the infrared and visible segments of the electromagnetic spectrum. These imaging modalities can encompass a diverse array of intricate details and features. The selection of an imaging approach carries distinct advantages and disadvantages contingent upon the specific application and disparities between infrared and visible depictions in object representation. In addition to their ability to convey finer elements like color, texture, shape, and contrast, visual images also conform to the perceptual traits inherent to human observation. Nonetheless, infrared images may exhibit a different level of intricacy evident in color, texture, shape, and contrast than their visible counterparts. Leveraging advanced deep learning technologies, the amalgamation of visual and infrared photographs synergizes the textural insights of visual images with the thermal data offered by infrared images, thereby affording a composite set of advantages. This article explores deep learning techniques for combining infrared and visible images, focusing on their application in image fusion. It reviews various fusion approaches, including CNNs, GANs, auto-encoders, and transformers, and evaluates fused images using subjective and objective methods. The survey provides a comprehensive overview of current research and suggests future directions in deep learning-based fusion methods


INTRODUCTION
Image fusion improves images by combining those captured through different types of sensors.A reliable and instructive image is created to aid further processing or decision-making.It is essential to extract image information efficiently and apply fusion principles appropriately to formulate a successful fusion approach It is possible to extract useful information from input photos using feature extraction techniques, which can then be smoothly combined with the fused image to prevent artifacts from being introduced into the final image.This process may be carried out in several different ways.Various image pairs can be fused, including visible and SAR images, infrared and SAR images, infrared and visible images, and medical images such as CT and MRI scans [1,2].Figure 1 summarizes the general method of fusion images.
In addition to remote sensing and medical imaging, image fusion technology also finds extensive applications in security, surveillance, human visual assistance systems, and the military.Therefore, its study holds significant importance in these application [3].
Due to its wide range of applications, image fusion has gained significant attention in the research community in recent decades.With the advancement of this field, many image transforms and spatial filters have been developed to accommodate both general and specific types of images.A fundamental goal of image fusion is to produce visually appealing results while also achieving high levels of objective results [4].The ability to describe details and accurately represent hot targets can be greatly enhanced by combining infrared and visible light pictures in a fusion process.It is the infrared sensor's ability to detect infrared radiation that results in the infrared image being created.Usually, the target area of the image is well-lit, so distractions like darkness and smoke can be ignored.Despite this, the infrared sensor cannot detect finer details in a scene due to its insensitivity to brightness variations.Because visible sensor images contain a great deal of texture information and are highly spatially resolved, they are ideal for human eyesight.Visual sensors, however, can be influenced by smoke, darkness, etc., causing them to be hard to see.In light of the aforementioned complementary aspects, combing infrared and visible images is extremely beneficial in retaining the target area [5].
In order to integrate thermal images and visual images, the researchers explored various methods.A deep learning technique and a traditional method are the two types of techniques used in this field.It is common for traditional methods to follow a three-step process when merging pixellevel or predefined features.The first step is to extract features from the original photographs, called feature extraction.Once the features have been extracted, combination schemes are applied to combine them during feature fusion.As a final step, the fused image is recreated by utilizing its matching inverse transformation [6,7].The mathematical transformations used in traditional infrared and visible image fusion techniques can be further divided into six groups: multi-scale transform-based (MST) [8,9], low-rank representation (LRR) [10,11], sparse representation-based [12,13], saliency-based [14,15], subspace-based [16], and hybrid-based methods [17].The following Figure 2 shows some examples used in the traditional method.By using transform operators, MST methods decompose an image into sub-layers, design strategies of fusion to combine the sub-layers, and use the inverse transformation to produce the final picture.The transform operator like the Laplacian pyramid and produced the weight map which was utilized to combine the relevant layers by taking into account brightness local entropy, contrast, and; therefore, even in low-light situations, excellent results can be achieved [22].However, the MST method is heavily influenced by the transformation used, and if the fusion rules are incorrect, the results can show artifacts [23].
SR (sparse representations) are an alternative to multiscale transforms (MST).In SR, the goal was to build a very comprehensive dictionary that could be used sparsely to indicate the input images.From the merged sparse representation coefficients, the output (fused) image can be recreated [6].
In order to display visible and infrared pictures, the discrete cosine transform was applied using a fixed over-complete dictionary.It is possible to enhance the visual impact (quality) of combined pictures in target-oriented fusion techniques by using salience methods because salience approaches help to preserve the stability of the key goal area in a fused image [24].A visual saliency map and guidance Gaussian filter and rolling were used to separate the pictures into different layers and fuse the layers to create a fused image to increase the amount of visual information in the fusion output [25].A low-rank representation (LRR) decomposes pictures into sparse and low-rank components in a method that is efficient.The lowrank components represent the image's global structure, while the sparse components represent its local details.In order to produce the final fused image, the sub-layers are fused according to appropriate rules.Although traditional image fusion methods have produced satisfactory results, they still have three limitations.The final result is determined by the goodness of the handcrafted features used in the combining process.In addition, conventional approaches, such as sparse representation (SR), can be computationally expensive.In addition, fixed fusion methods need to be tailored to different datasets of images [6].Conventional image processing methods have limitations due to their human-made design, limited generalization capacity, and computational complexity.These methods depend on input photos and output attributes and may not capture crucial information or handle dynamic scenarios.They also need help with high-resolution or multimodal images and may be unable to take advantage of the growing amount and diversity of image data and generate a fused image with fewer imperfections, reducing computational expenses.
Deep learning (DL) is a widely used method for picture fusion because of its adaptability, resilience to errors, and ability to reduce noise.Conventional image fusion techniques typically use a sequential procedure consisting of feature extraction, combining schemes, and inverse transformation.

576
However, this approach can lead to unwanted artifacts in the fused image and can be both intricate and time-consuming to develop.On the other hand, deep learning (DL) techniques can modify the weights of the fusion model using an adaptive mechanism.This enables the model to understand the many characteristics of the photos and generate a fused image with fewer imperfections.Deep learning approaches also exhibit much reduced computational expenses compared to traditional fusion rules, a critical factor in numerous fusion scenarios.In addition, deep learning algorithms can automatically extract features from photos, which makes them highly suitable for tasks such as fusion.They can also handle high-resolution or multi-modal images, resulting in a fused image with fewer flaws.Hence, deep learning algorithms have surpassed traditional methods and are widely employed in image fusion.

FUSION METHODS OF INFRARED AND VISIBLE IMAGE FUSION BASED ON DEEP LEARNING
By fusing multiple pictures with various characteristics into a single, high-quality picture, deep learning is a technique that uses deep neural networks.Recently, deep learning techniques for image fusion have grown in popularity because they can automatically extract features from images, which makes them well-suited for jobs like fusion.To overcome the drawbacks of traditional fusion approaches, deep learning techniques are used for feature extraction in several applications, including image classification, image processing and object recognition.In deep learning image fusion, there are several types of neural networks: convolutional neural networks, generative and adversarial networks, auto-encoders and transformers.Despite the high quality of image fusion results produced by deep learning techniques, there are still some areas that need improved.To provide a complete picture of each method, we now discuss its different aspects separately.

Convolution neural network-based fusion approaches
The fusion of infrared and visible images utilizing a deep learning architecture is a straightforward and efficient technique [26].By splitting low-frequency data and texture data into two components, the authors are able to extract deep features from meticulous content by utilizing the multilayer fusion strategy of the VGG-19 network [27].Some loss functions are significantly impacted on CNN's capacity for learning.Method proposed for transferring the style of one image to another utilizing CNN [28].The process extracts deep features from the produced picture, the style image, and the content image at different layers of the VGG-19 network [27], then minimizes the difference between the created and original images' deep characteristics.
The ResNet50 pre-trained network, as recommended in reference [29], was employed to extract deep features from the source images.This network comprises of 50 weight layers and 5 2D convolutional blocks.The technique of zero-phase component analysis was employed to standardize the deep features and acquire the initial weight map.Ultimately, the soft-max procedure was employed to get the ultimate weights for the source images, and the merged image was reconstructed using the weight-averaging strategy.Using a multi-channel convolutional network, three channels were employed for obtaining features: visible features, infrared features, and features that are common to both infrared and visible images.With the addition and averaging of the featured pictures, the decoding module produces the fused image, and in order to deal with the lack of labeled data, a variety of loss function techniques were utilized.By reworking the loss function, the visible and thermal infrared images were combined adaptively, and noise interference was reduced.The technique is computationally efficient and may preserve important texture details and characteristics without showing any obvious artifacts [30,31].
Convolutional neural network (CNN) fusion methods are highly efficient at merging infrared and visible images, extracting profound characteristics, maintaining data integrity, and improving contrast and visibility.These technologies are versatile and can be applied to various circumstances and applications, including medical imaging, night vision, and remote sensing.Nevertheless, these models necessitate substantial quantities of training data and computational resources, and they may encounter problems such as overfitting, generalization, or transferability challenges, as well as potentially introducing artifacts or distortions.The enhancements encompass the utilization of sophisticated network structures, integration of pre-existing knowledge, and the creation of resilient assessment criteria.Artifacts can be managed via pre-processing techniques, skip connections, residual blocks, or loss functions.Potential areas for future research involve investigating the integration of several modalities, examining dynamic or temporal images, and implementing fusion techniques in many domains, such as biomedical imaging, security, surveillance, and cultural heritage.

Generative and adversarial network-based fusion approaches
Deep learning technology is typically used as a foundation for CNN model-based image fusion; however, in this case, the model requires ground truth, so establishing fusion picture standards for combining visible and infrared images is not practical.The ground truth is not taken into consideration when building a deep model that assesses the blurriness in each area of the source image, and then determines the weight.By using a network that generates countermeasures, it is possible to avoid the aforementioned problems by fusing infrared and visible images [32].Through the use of a target edge-enhancement loss function, target textures were optimized, and the target is now more clearly visible in the fusion output [33].They also created a detail loss function for more semantic information, as the FusionGAN may lose pixel information from the infrared images.To balance the information between the infrared and visible images, GAN with multi-classification restrictions was suggested [34].In contrast, these methods emphasize improving visual quality, while ignoring the importance of facilitating the fusion of outcomes following high-level vision challenges by utilizing a prefused picture as the generator's label [35].As a result, the generator has been trained to produce images that are as similar as possible to the prefused picture.This method ensures that the fused picture retains both the infrared image's thermal information and the rich texture of the viewable picture.Scientists argue that this approach outweighed its disadvantages even though it was computationally expensive due to the pre-fused picture for each training cycle.Figure 3 shows an image fusion technique based on GAN [36].
GAN-based fusion techniques have demonstrated potential in image fusion by producing high-quality fused images that incorporate significant characteristics from both infrared and visible images.Nevertheless, they can incur substantial computing costs and fail to address complex visual tasks.GAN-based fusion algorithms address potential artifacts by creating countermeasures, but their efficacy relies on the quality of the pre-fused image utilized throughout the training process.Future investigations may center on advancing more effective training techniques, novel loss functions, and the customization of the fusion process for diverse datasets.

Auto-encoder-based fusion approaches
Presented an unsupervised auto-encoder network [37], the network elicits the feature from the original pictures using CNN and dense blocks.Using the appropriate fusion technique, the fused feature is then decoded by the decoding module, which incorporates the dense block into the encoding module, preserving as much data as possible.Figure 4 illustrates the auto-encoder's fusion approach.In Nestfuse, the nest connection architecture is utilized as the decoding network, while the encoder network is converted into a multi-scale network [38].In order to fuse the prominent parts of the picture with the background information, spatial/channel attention fusion techniques are implemented, but multi-modal features cannot be successfully utilized with this handcrafted approach.Utilized the RGB-thermal fusion network (RTFNet) [39], a three-part system: an RGB encoder, an infrared encoder for extracting features from RGB and thermal images, and a decoder for restoring feature picture quality.The accuracy of the estimated feature map may be recovered with a new encoder when using RTFNet for feature extraction if the encoder and decoder are geographically symmetrical.As the method's primary application is scene segmentation, the edges are not crisp.•Enhancing the utilization of multi-modal features and the sharpness of edges; •Investigating techniques for handling artifacts; •Optimizing the symmetry of the encoder and decoder to improve the estimate of feature maps.

Transformer-based fusion approaches
Transformer has experienced significant success in its initial application to natural language processing [40], and while CNN focuses on local aspects, its attention mechanism can assist in developing long-range reliance, allowing it to utilize global data in both deep and shallow layers better.According to the vision transformer concept [41], the vision transformer has a lot of potential for computer vision (CV).Recently, CV researchers have been using more transforms, such as object identification, multiple object tracking, segmentation, and others, to do so.Transformers are based mainly on attention mechanisms [42].An integrated model based on the transformer was suggested, and it performed well on several low-level visual tasks [43].The global spatial dependency of transformers has been applied to several areas of computer vision.We focus on the overall correlation of picture space and channels throughout the fusion process, motivated by the properties of the trans-former, as proposed TGFuse, which involves using a lightweight transformer module and adversarial learning for visible and infrared image fusion [44].Through the use of the transformer technique to build efficient global fusion interactions, shallow features extracted by a CNN in the transformer fusion module can interact with each other.This interaction simultaneously improves the fusion connection across channels and within the spatial range.By enforcing competitive consistency from the inputs during the training process, adversarial learning can enhance outcome discrimination.An improved fusion model for focal Transformers, based on the multi-modal feature selfadaptive fusion technique, is proposed to provide a fused image that is both visually appealing and more informative by fusing infrared and visible information [6].A spatiotransformer (ST) fusion method was used to fuse images obtained from different sensors in the proposed technique [45].There are three parts to the image fusion transformer: an encoder network, an ST fusion network, and a nested decoder network.The ST fusion network, which consists of spatial and transformer branches, then fuses features at multiple scales.
Transformer-based fusion methods have demonstrated potential in the field of picture fusion.They efficiently leverage worldwide data and enhance integration linkages between channels within the spatial scope.Nevertheless, individuals could encounter difficulties when dealing with the intricacy of the transformer strategy and the computational expense of adversarial learning.
These methods do not directly deal with the management of artifacts.Further research and development in this area would be advantageous.
Potential enhancements can be achieved in the efficacy of worldwide fusion interactions and adversarial learning.Possible future research directions include investigating artifact handling approaches and optimizing the encoder's and decoder's symmetry to enhance feature map estimation.
Furthermore, prospective studies might focus on applying these methods to datasets with greater diversity and including more sophisticated attention mechanisms, which would be highly beneficial.

ASSESSMENT OF FUSED IMAGE
The optimal algorithm, approach, or measure for improved picture evaluation is often chosen by comparing different image processing approaches.For a variety of imageenhancing tasks, including the fine-tuning of image resolutions for alignment, the overlaying of two picture products, and the mixing of images for feature extraction and target recognition, image fusion is a common option.Since image fusion is used in many geospatial and night vision applications as well as objectively evaluating image fusion algorithms [46], it is crucial to understand these methods.Different point-specific assessment indicators can be used by researchers to make quantitative references and precise image fusion comparisons.Subjective evaluation and objective evaluation can be used to categorize the available integration indicators [47].

Subjective evaluation approaches
The subjective assessment is evaluated in absolute and relative terms using well-known five-level quality scales and obstacle scales, respectively [48].An effective subjective assessment method involves visually inspecting the picture without any aids and carefully analyzing its characteristics, distortion, contrast, and image integrity to evaluate different fusion processes.Subjective assessors can use the assessment criteria to assign a quality grade to the merged picture.However, various people have different standards for evaluating the same image, and these standards can be easily influenced by context, environment, and other variables, resulting in inaccurate answers to the merged image.Given its poor goodness and delayed timeliness, it is not easy to assess fusion images using this approach in several dimensions.To accurately assess fusion outcomes, objective assessment markers must be combined [49].
The subjective evaluations of fused pictures are constrained by the divergent criteria employed by different assessors, which might be swayed by factors such as context, surroundings, and personal biases.Consequently, this can result in consistent and correct assessments.For instance, the interpretation of an image can vary among individuals due to their subjective perceptions, environmental influences, and personal biases, leading to inconsistent and incorrect assessments.

Objective evaluation approaches
Objective evaluation measures are created and utilized to overcome the constraints of subjective evaluations.These metrics use accurate formulas to produce relevant index data of the fused image.Image fusion benefits from their inclusion by offering a more standardized and consistent evaluation method.The fundamental principles of these metrics, including those derived from information theory, structural similarity, feature similarity, and source and output images, strive to offer a more quantitative and impartial evaluation of the quality and effectiveness of fused images.Using reference and non-reference standards, these metrics provide a more systematic and dependable approach to assessing the efficacy of picture fusion techniques [49,50].

Metrics Based on Information Theory
(1) Entropy (EN) where, L is the gray level of the image from 0 to 255, Pi is the probability of the gray level I in the image.EN might indicate the texture richness and average info in the merged picture.
The quantity of info in the fused picture is more plentiful the greater the EN is.And one of the most often used indicators for evaluating image quality is EN.If the fused picture had noise and artifacts, however, the value of EN would significantly rise and cannot accurately reflect the goodness of the final picture.Particularly, IR images will have a lot of noise.Therefore, we believe EN should only be employed as a secondary assessment metric on IR and VI image fusion [48,51,52].
(2) Mutual Information (MI) MI is used to calculate the amount of info that was transmitted from the source image to the fusion image.According to information theory, MI denotes the statistical interdependence of two random variables [25,53] and has the following mathematical definition: A high MI metric signifies that a lot of info is transmitted from the input pictures to the fused image, which signals good fusion performance, whereas MIFA and MIFB indicate the amount of info that went into creating the fusion pictures from the infrared and visible photographs, respectively.
(3) Peak Signal-To-Noise Ratio (PSNR) SNR is used to compute the peak power to noise value of ower [54,55].These are the criteria for this metric: In Eq. ( 3), r denotes the fused image's peak value.A high PSNR value indicates that the fusion procedure is less damaged and that the fused picture is identical to the input image.

Metrics based on structural similarity
(1) Structural Similarity Index Measure (SSIM) Mathematically, SSIM between two components U and V is expressed as where, σU, σV,σUV are the variances and covariance and µU, µV are mean intensities.The structure, contrast, and luminance distortion between the fused picture and the original pictures are combined in the design of SSIM by modeling any image 579 distortion as a contrast distortion, mix of loss correlation, and radiometric [56,57].
(2) Mean Squared Error (MSE) The fault and the actual distinction between the perfect or estimated outcomes are computed using MSE [58,59].According to its definition: ( ) where, m and n are the height and width of the picture, indicating the pixel rows and columns, A and B are the ideal and evaluate able compound pictures, respectively, and i and j are the pixel row and column indexes.
(3) Correlation Coefficient (CC) CC has the following mathematical definition and assesses the degree of linear correlation between a fused picture and visible and infrared pictures [60,61]: where, X denotes the original image.X and F represent fused images, and H and W stand for the length and width of the original picture.

Metrics based on feature similarity (1) Average Gradient (AG)
The fused image's gradient information is quantified by the average gradient (AG) metric, which also exemplifies its detail and texture [49,62,63], following defines: ,, 1 2 where, Fi=Fk,l -Fk+1,l, Fj=Fk,l-Fk,l+1, M and N represent the dimension of fused picture F at pixel level.The greater the average gradient value, the greater the data in the picture, resulting in a superior fused outcome.
(2) Standard Deviation (SD) The idea is the distribution and contrast of the merged picture serve as the foundation for the standard deviation (SD) measure [2,64].SD is described mathematically as follows: , where, μ stands for the fused image's mean value.Our eyes are naturally drawn to areas with strong contrast as we are highly sensitive to visual differences.
(3) Spatial Frequency (SF) The concept can be split into two parts: spatial column frequency (CF) and spatial row frequency (RF).The formulas for both are displayed below.Spatial frequency, which indicates the total activity of a picture in the spatial domain.

SF RF CF =+
SF stands for both the image's spatial change and the precision of the details.The textures and edges get richer as the SF gets bigger.Additionally, it operates apart from the reference image.The value of SF will increase due to the undesired artefacts in the combined IR and VI pictures.The quality of the merged image cannot be accurately reflected by the SF in this situation [14,48,65].
(4) Gradient-Based Fusion Performance (Q AB/F ) Based on the presumption that the edge information in the original pictures is preserved in the fused picture, Q AB/F assesses the quantity of edge info that is transmitted from the original photos to the fused picture [66,67].Following is a definition of Q AB/F : ( 3.2.4Metrics based on source and produced images (1) Visual Information Fidelity (VIF) VIF was created based on visual information fidelity (VIF) and is solely utilized in image fusion [68].The visual data from the original image was extracted using the VIF model by Han et al. [69].After additional processing to eliminate the distortion of information, they were able to successfully fuse the visual data.The VIF, which is specifically utilized for fusion assessment, is generated after incorporating all of the visual info.summarizes the VIF calculating procedure into four stages.It is necessary to first filter and partition the fusion image into numerous blocks from the source image.Check to see whether any of the blocks have distorted visual data next.Third, check the accuracy of the visual data in each block.In the fourth stage, the overall index based on VIF is determined [69].
(2) Other measures The metrics QCB and QCV, which gauge how well-fused pictures work visually, are based on what humans see.An important metric for assessing an algorithm's performance is running time.The computational effectiveness of the model is assessed using the time-consuming nature of an image fusion technique [47,49,70].

EXPERIMENT
The number of studies and methods in the field of image merging is growing every day.The primary goal is to explore the present issues and potential directions for image fusion as they relate to diverse fields, including surveillance, photography, medical diagnosis, and remote sensing.The following data sets were used in the tests for the visual and infrared image fusion in this field: TNO dataset [71] is a collection of multispectral nighttime images captured by several multiband camera systems in various military-relevant settings.The FLIR dataset offers comparable RGB pictures and annotated thermography datasets.14,452 infrared pictures altogether are included in the collection.The majority of the pairs of photos in the LLVIP collection [72] were captured in extremely dark environments, and each pair is perfectly matched in time and location.The KAIST [73] data collection contains different broad sceneries of a campus, a street, and a rural area.Each image has an associated visual picture and thermal picture.With a spatial resolution of 480 × 640, the infrared and visible picture pairings in the MSRS [74] collection include both daylight and nighttime settings.Table 1 shows performance of some deep learning-based image fusion techniques.Experiments were conducted on 10 pair of images collected from KAIST data set.
Table 2 shows the results of some fusion methods based on deep learning conducted on 21 pair of images collected from TNO dataset.  1 quantify the excellence and efficiency of several image fusion techniques that rely on deep learning.The metrics encompass measurements related to the differentiation in brightness, the level of detail, the distinctness, and the accuracy of the merged pictures.According to the values, the TGFuse approach demonstrates the top scores in most metrics, with NestFuse, and U2Fusion closely following.These findings indicate that TGFuse is the most efficient and resilient technique for image fusion, mainly when applied to the KAIST dataset.According to the information presented in Table 2, the MTNO technique exhibits highest scores in most criteria, with DDcGAN, U2Fusion, IFCNN, and DeepFuse closely trailing behind.The findings suggest that MTNO is the optimal and robust method for image fusion, mainly when used with the TNO dataset.FusionGAN and DenseFuse exhibit inferior performance across all criteria, indicating their diminished efficacy in picture fusion compared to the other approaches.Nevertheless, diverse datasets and settings may necessitate distinct evaluation metrics and criteria contingent upon the aim and application of picture fusion.
The objective experimental findings indicate that the TGFuse technique achieves the highest scores in most criteria, especially when applied to the KAIST dataset.Similarly, the MTNO approach demonstrates superior performance in most criteria, particularly when used in the TNO dataset.Hence, the authors should concentrate on extensively investigating and enhancing the TGFuse and MTNO approaches for image fusion.These methods have demonstrated superior efficiency and durability in their specific contexts, and their ongoing progress through deep learning techniques has resulted in the creation of improved new technologies.By conducting a more thorough investigation of the methods and algorithms utilized in TGFuse and MTNO, the authors have the potential to reveal valuable insights that can further advance the field of image fusion.Furthermore, it is advantageous for the authors to contemplate the possible versatility of these strategies about other datasets and environments, as suggested by the results, to guarantee their strength and effectiveness in a wide range of picture fusion applications.While the image fusion technique is making some progress, there are still several issues for which there is no ideal answer.In the future, it will be necessary to enhance and explore the issues with picture fusion.Although several convolutional neural network-based image fusion models perform well, most of them fall short of perfection.Finally, to completely maintain the feature information acquired from each layer of convolution, the fusion approach utilizing the convolution neural network must give focus on improving the fluidity of the intermediate layer network's features.

CONCLUSIONS
The "Deep Learning Methods for Fusion of Infrared and Visible Images: A Survey" concludes the progress made in image fusion, explicitly focusing on deep learning techniques.The survey systematically assesses the fusion methods by employing various picture metrics, distinguishing their respective contributions, advantages, and constraints.Furthermore, it clearly defines the current state of research on the fusion of infrared and visible images and provides a framework for possible future study directions.
The survey highlights the notable progress of employing deep learning approaches in infrared and visible image fusion.It emphasizes the enhanced efficacy of image fusion methods, creating fused images with various possible uses.Nevertheless, the survey recognizes the current difficulties and discussions surrounding the application of deep learning in picture fusion.It highlights the necessity for further investigation and creativity to these intricacies and guarantee the continuous advancement of this domain.
From an objective and subjective standpoint, the survey's comprehensive assessment of the fusion algorithms offers significant insights into the effectiveness of different fusion procedures.This analysis discerns the advantages and constraints of various methodologies, providing a comprehensive comprehension of their efficacy in varied circumstances.The survey's emphasis on objective and subjective judgments highlights the need to use a complete evaluation methodology to appropriately measure the quality and efficiency of image fusion techniques.
The "Deep Learning Methods for Fusion of Infrared and Visible Images: A Survey" is a helpful resource for scholars and practitioners in image fusion.This presentation demonstrates the progress made possible by deep learning and highlights the unresolved obstacles and the prospective directions for future research.The survey enhances the improvement and innovation in the field of image fusion by offering a comprehensive and unbiased evaluation of the present state of the art.

Figure 4 .
Figure 4. Auto-encoder-based infrared and visible image fusion [39] Auto-encoder-based fusion methods, such as the unsupervised auto-encoder network, Nestfuse, and RTFNet, have shown promise in image fusion by efficiently extracting features and maintaining data integrity.Nevertheless, they encounter difficulties dealing with multi-modal signals and achieving precise edge recognition.Furthermore, these

Table 1 .
The performance of some methods using 10 pair pictures from KAIST data set and the best first two values are indicated in bold and red Italic font

Table 2 .
The performance of some techniques based on deep learning using 21 pair of pictures from TNO dataset and best first two values are specified in bold and red Italic font