Advancing Biometric Identity Recognition with Optimized Deep Convolutional Neural Networks

ABSTRACT


INTRODUCTION
Recent advancements in communications, coupled with increasing security demands, have stimulated active research within the biometric community on identity recognition based on physical characteristics.This domain is pivotal for applications such as surveillance, crowd analytics, automated identity checks, and device unlocking.Yet, the construction of a robust identity recognition system necessitates the use of physical traits that are consistent, measurable, and universal [1].
The quest for a unique biometric data representation is crucial for the success of automatic identification systems.This quest has been vigorously explored in facial recognition but remains challenged by uncontrollable image conditions and significant intraclass variations, brought about by changes in illumination, perspective, age, clutter, occlusions, or inconsistent image size or quality.The human ear, akin to the face, presents biometric data that can be utilized for individual identification [2].Unlike face images, however, ear images are immune to physical deformations due to the absence of muscle activity.These images offer a consistent and recognizable structure, typically unaffected by movement, speech, or facial emotions.The visual appearance of the ear is generally unaltered by hygienic factors such as makeup, an added benefit.Furthermore, capturing ear images is less invasive than acquiring facial images, and in comparison to other biometric modalities such as fingerprinting, ear imaging is a contact-free procedure.For surveillance applications, ear images ensure quality and non-interchangeability that do not necessitate the subject's cooperation.These advantages posit ear images as suitable candidates for incorporation into automated identification systems or as complementary to facial recognition for enhancing the accuracy of profile images [3].
The early stages of ear recognition operated using a conventional processing pipeline, comprising feature extraction and classification.This two-step method dominated identity recognition using ear images, with manual feature extraction followed by standard classifiers [4][5][6][7].Initially, textural information or structural descriptors of the ear were exploited to manually craft discriminative features.Subsequently, a linear or nonlinear classifier was applied to actualize identity prediction.Despite these techniques' proven success with small and mid-sized ear datasets, they are unscalable to larger datasets.Manual feature extraction is subjective, requiring domain expertise, and the extracted features often fall short in capturing complex patterns and relationships in ear images.Moreover, conventional classifiers often underperform when applied to complex datasets with high dimensionality or nonlinear relationships between features.They are also prone to overfitting, particularly when the number of features is large relative to the number of samples in the dataset.Nonetheless, current trends in ear recognition favor deep learning-based techniques, attributed to their scalability and superior recognition performance [8][9][10].
Deep learning has rapidly evolved into a popular datadriven learning approach for various image recognition problems, including object detection [11,12], deepfake detection [13,14], and medical image analysis [15][16][17][18].It amalgamates the feature extraction and classification processes into a single end-to-end model.Deep Convolutional Neural Networks (CNNs) offer an effective way to utilize deep learning for image recognition.As computers perceive images through pixels, the convolution process uses the relationship between image pixels to aid in image identification.An advantage of CNNs over their predecessors is the automatic detection of significant features without human intervention.Training deep neural networks also tunes the representation of the input data to the specific task, contributing to the high adaptability of deep learning techniques.However, this comes at a cost.Training such networks requires a large amount of data to counteract overfitting, a common problem in machine learning where an identification system memorizes the training images rather than learning the underlying relationships in the input data to distinguish between individuals.To mitigate this problem, this study exploits transfer learning, a potent technique in machine learning that involves reusing pretrained models and adapting them to new tasks [19].
In the realm of ear recognition, Abd Almisreb et al. [20] applied transfer learning to AlexNet [21] as a solution for ear recognition.A modest collection of 300 ear images from 10 subjects was used to fine-tune the pretrained network, with 250 images allocated for training and 50 images for validation and testing.The fine-tuned network achieved a remarkable recognition accuracy nearing 100%.Transfer learning was also employed on the renowned VGG16 model, paired with a Support Vector Machine (SVM), to fashion a hybrid algorithm for individual identification via ear images [22].A dataset comprising 2600 ear images showcasing various poses, rotations, and illumination variations validated the proposed model.The model achieved a recognition accuracy of 98.72% in classifying ear images.In another study [23], transfer learning was used with VGG16 and ResNet50 models to accelerate model construction and enhance ear recognition performance.The IIT Delhi ear dataset, consisting of 1286 ear images from 221 subjects, was used to evaluate these models.The VGG16 model outperformed ResNet50, achieving a recognition accuracy of 89.73% based on their experiments.Alejo and Hate [24] examined the use of transfer learning to tackle the challenge of unconstrained ear recognition.Eight different pretrained CNN models were explored, and their performances were compared on a dataset of 250 ear images from 10 subjects, sourced from the Internet.The models achieved a recognition accuracy above 95%.In this study, we apply transfer learning methods to enhance the accuracy of ear recognition models by leveraging pretrained models on related image recognition tasks.One approach to using transfer learning for ear recognition is to utilize pretrained models trained on similar visual recognition tasks, such as face recognition or object recognition.These models can then be fine-tuned with ear images to improve their performance on the ear recognition task.This approach is particularly useful when there is limited labeled ear image data available for training the models from scratch [25][26][27].
The main contributions of this work can be encapsulated as follows:  We explore and compare transfer learning and finetuning of deep CNN architectures to aid in the design process of robust ear identification systems. Various deep CNN architectures are evaluated for their robustness in recognizing individuals from their ear images. The proposed models are compared against a range of deep learning-based recognition methods on two publicly available benchmark datasets, aiming to enhance the overall recognition accuracy. We provide visual explanations for a better understanding of the proposed models, offering insights into which parts of the ear image are most critical for recognition decisions.
The remainder of the paper is organized as follows: Section 2 discusses the related previous work.Section 3 describes the different CNN architectures employed in this work.Section 4 outlines our methodology for building robust ear recognition models.The experimental settings and results are presented in Section 5. Finally, Section 6 concludes the paper and outlines the future research direction.

RELATED WORK
The potential of the human ear as a biometric modality for identity recognition has been the focus of numerous research studies [28][29][30].Previous studies have demonstrated that ear recognition, using traditional machine learning methods, can achieve satisfactory recognition rates with carefully crafted features [31][32][33].However, these methods often prove sensitive to noise, illumination, and pose variations, and may struggle to capture the subtle differences among ear images.To overcome these limitations, the introduction of deep learning-based ear recognition methods has been proposed.Researchers are exploring the potential of deep network architectures such as CNNs and transfer learning methods to enhance performance.As ear image datasets are relatively small and overfitting is a concern, deep learning techniques have only recently been applied to ear recognition tasks.Strategies such as aggressive data augmentation, model size reduction, regularization techniques, or transfer learning using pretrained models from large datasets like ImageNet are crucial to navigate these constraints.The most significant challenge for current ear recognition algorithms is the transition from controlled to uncontrolled imaging conditions, typically attributed to the lack of extensive, large-scale ear datasets encompassing highly variable ear images.
Recently, the use of deep neural networks for ear recognition has led to notable improvements in recognition performance compared to traditional methods.Emeršič et al. [34] examined the challenge of training deep CNN-based models using a limited number of ear images.They explored three different network architectures and various approaches to train models, employing different degrees of image augmentation up to 100 times the original training image.A combined dataset of 2304 ear images from 166 subjects was used for model training and evaluation.The top-performing models demonstrated the capacity to automatically learn discriminative features from raw ear images, achieving a recognition accuracy of 62%.In another study [35], the authors conducted an experimental exploration of ear recognition using various deep architectures of increasing depth, specifically, the VGG models [36].The AMI and WPUT ear datasets were utilized for the experiments, with three different network training strategies tested.The strategy of fine-tuning emerged as the most effective, achieving a rank-1 recognition accuracy of 96.78% and 74.36% on the AMI and WPUT datasets, respectively.To further enhance accuracy, numerous model combinations were ensembled.A similar study was conducted in the study [27], where the authors experimented with various deep residual networks (ResNet [37]) of differing depths.Fine-tuning strategies were employed to achieve improved performance.Four ear datasets, containing ear images captured under both constrained and unconstrained imaging conditions, were used to evaluate the models.The proposed models achieved rank-1 recognition accuracy ranging from 67% to 99% for ear images taken under unconstrained and constrained conditions, respectively.Moreover, the best results were garnered using an ensemble of several fine-tuned ResNet models of varying depth.
To assess the effectiveness of ear recognition technology on a substantial ear dataset, the inaugural Unconstrained Ear Recognition Challenge (UERC) [21] was held in 2017.For model evaluation, the challenge considered tightly cropped ear images that exhibited various head movements (poses), lighting changes, image resolution variations, and occlusions.The ear recognition methods submitted for the challenge were evaluated and analyzed to investigate their capacity to handle these diverse variations in ear images.The study shed light on some key findings, including the need for significant performance improvements prior to deploying ear recognition technology in unconstrained environments.Additionally, the experiments revealed that the evaluated methods were sensitive to changes in head poses.
The authors [7] proposed a framework to tackle the unconstrained ear recognition problem using multiple ear image datasets.They leveraged CNN-based models for ear normalization and description, combined with a set of handpicked handcrafted image descriptors, and then fused both handcrafted and CNN-based features.A variety of feature combinations were tested, and substantial improvements in recognition performance were reported when both feature types were combined.
The authors [26] proposed a two-stage domain adaptation strategy of fine-tuning deep CNN-based models to address the unconstrained ear recognition problem.They conducted a thorough analysis of several crucial factors such as dataset bias, illumination, aspect ratio, the impact of data augmentation, and alignment on ear recognition performance.
Alshazly et al. [38] conducted an extensive study on unconstrained ear recognition.They employed different transfer learning strategies using well-known deep CNN architectures, including AlexNet, VGG, Inception [39], ResNet, and ResNext [40], to overcome the problem of inadequate data for training deep CNNs from scratch.The experiments were conducted on the EarVN1.0dataset [41], which comprises 28,412 ear images from 164 subjects obtained from the web.The results indicated an improved recognition performance above 93% when using the finetuning strategy of pretrained deep CNN architectures with custom-sized inputs to maintain the aspect ratio of the image in the EarVN1.0dataset.
Khaldi et al. [42] proposed implementing a deep unsupervised active learning (DUAL) strategy in the field of ear recognition.They used a pretrained VGG16 model and applied it to three ear datasets.The training process was divided into two stages: an initial supervised training stage using a classification model, followed by an unsupervised active learning stage.Three ear image datasets comprising ear images captured under both constrained and unconstrained imaging conditions were used for training and performance evaluation.The proposed technique achieved superior performance, indicating a significant improvement in the recognition rate.
Our study builds upon previous research on unconstrained ear recognition and reports the results of experiments conducted on two challenging ear image datasets.Further, we evaluated the performance of new deep CNN models (DenseNet and MobileNet) in ear identification.Given the limited quantity of ear images in the datasets considered, we proposed a transfer learning approach to fine-tune the pretrained models on ear images, aided by data augmentation, to achieve improved recognition performance.We provided a detailed comparative analysis using various performance evaluation metrics and reported the results achieved by each model.Owing to the black-box nature of deep models and in an effort to make them more transparent, we applied the Grad-CAM visualization technique.Grad-CAM provides visual explanations, highlighting the significant ear image regions that are frequently considered by the models for making accurate predictions.

DEEP ARCHITECTURES
Deep CNN architectures are the type of deep learning algorithms particularly well-suited for image processing and recognition tasks.Over the years, CNN architectures have evolved and different variants of the CNN architectures have been developed, resulting in incredible advancements in the growing field of deep learning.The development of new architectures is to achieve comparable accuracy and address the problems related to computational efficiency, error rate, and gradient vanishing or exploding.This section provides a brief description of the deep CNN architectures employed in our study to construct the ear recognition system.Moreover, it highlights the diverse building blocks and approaches used to build these deep architectures.

ResNet
A residual network (ResNet), is a deep neural network architecture that was introduced by He et al. [37] in 2015.It is a modification of the traditional CNN architecture that overcomes the problem of vanishing gradients that occurs in very deep networks.The ResNet architecture introduces the concept of residual learning, which involves the addition of shortcut connections between layers that skip over one or more layers in the network.These shortcut connections enable the gradient to flow directly through the network, which makes it easier for the network to learn from the data.The ResNet architecture is characterized by a series of residual blocks that contain several convolutional layers and shortcut connections.The residual blocks allow the network to learn more complex features from the data and also enable it to be trained more efficiently.In addition, the ResNet architecture includes batch normalization and ReLU activation functions, which further improve the network performance.

DenseNet
A densely connected network (DenseNet) is deep network architecture that was introduced by Huang et al. [43] in 2017.It is a modification of the traditional CNN architecture that improves the efficiency of information flow between layers.The DenseNet architecture introduces the concept of dense connectivity, which involves connecting every layer to every other layer in a feedforward fashion.This dense connectivity enables the network to extract more feature representations from the input and effectively reuse them across the network.The cornerstone of the DenseNet architecture is the Dense block, which enables the network to learn complex feature representations from the input and effectively reuse them across the network.The dense connectivity pattern ensures that each layer in the block has access to all previously learned features, and reduces the number of parameters in the network, making it more computationally efficient.The DenseNet architecture is constructed using a series of Dense blocks that contain several convolutional layers, batch normalization, and ReLU activation functions.In each dense block, the output of each layer is concatenated with the outputs of all previous layers in the block.

InceptionV3
InceptionV3 is a deep convolutional neural network architecture introduced by Google in 2015 as an extension to the Inception family of models [39].It is a deep learning architecture designed for image classification and object detection tasks.The InceptionV3 architecture consists of multiple modules, with each module containing a combination of different convolutional layers.The main idea behind the InceptionV3 architecture is to use multiple filters of different sizes, which allows the network to capture features at different scales.This is achieved using a combination of 1×1, 3×3, and 5×5 convolutional filters within the same layer.Additionally, the architecture also incorporates the use of pooling and batch normalization layers to improve training stability and accuracy.The cornerstone of the InceptionV3 architecture is the Inception module, which consists of a combination of different convolutional layers.The Inception module is designed to maximize the use of computational resources by using parallel convolutional layers with different filter sizes.This enables the network to capture features at different scales and helps reduce the number of parameters in the network.

MobileNet
MobileNet is a convolutional neural network architecture introduced by Google in 2017 that is optimized for mobile and embedded devices [44].The main objective of MobileNet is to provide high accuracy with low computational cost and low memory footprint, making it suitable for deployment on mobile devices with limited resources.The MobileNet architecture consists of depthwise separable convolutions, which decompose the standard convolutional operation into two separate layers: depthwise and pointwise convolution.Depthwise convolution applies a single filter to each input channel, producing a set of output channels.The output channels generated by the depthwise convolution are then combined using a pointwise convolution.This approach significantly reduces the number of parameters in the network while maintaining high accuracy.MobileNet also uses a technique called linear bottleneck, where the number of filters is reduced at the beginning of each layer and then increased again at the end of the layer.This technique reduces the computational cost of the network while preserving its accuracy.

PROPOSED METHODOLOGY
This section describes the proposed framework for ear recognition.As discussed in the studies [45,46], the learned features of a deep CNN trained on large image datasets are highly transferable to other vision tasks and datasets.As the similarity between the pretraining and target tasks grows, the transferability becomes more effective.Nevertheless, the transfer of learned features, even from a remote task, is superior to learn them from scratch on the target dataset.

Transfer learning
Transfer learning is a machine learning approach in which a pretrained model is used as the starting point for a model on a new task.In the context of deep neural networks, two commonly used transfer learning methods are applicable: finetuning and feature extraction.The process of fine-tuning involves the following steps: 1. Load the pretrained model: Load the weights of a pretrained model, such as a deep CNN, from a publicly available source or one that you have trained yourself.2. Replace the last layer: Replace the last layer of the pretrained model with a new layer that has the number of outputs required for the new task.For example, if the used pretrained models were trained for image classification with 1,000 categories, and our ear datasets have only 100 and 474 subjects, then the last layer needs to be replaced with a new layer that has 100 and 474 outputs, respectively.3. Freeze some layers: Freeze some earlier layers in the pretrained model to prevent them from being updated during training.This is because the earlier layers in the model have learned more general features, such as edges and curves, that are likely to be useful for the ear recognition task.

Ear recognition framework
The ear recognition framework is based on the concepts of fine-tuning pretrained deep CNNs.The models included MobileNet, ResNet, Inception, and DenseNet that were trained on the ImageNet dataset.As we know, the final set of layers for these models are fully-connected (FC) layers with a softmax classifier.In fine-tuning, we actually remove the original set of FC layers and add a new set of FC layers, which are placed on top of the original architecture.These new FC layers can then be trained and adjusted to the specific ear image dataset.Usually, the newly added FC layers have fewer parameters than the original ones; however, this really depends on the particular dataset.The new FC layers are randomly initialized and connected to the body of original network.However, if we start training the entire network we face the problem of modifying the already learned and discriminating filters of the convolutional layers.The new FC layers are brand new and totally random and if the gradient backpropagates from these random values through the structure of the network, we encounter eliminating these discriminative features.To circumvent this, we freeze all layers in the network and allow only the newly attached layers to be adjusted.The network forward propagates the training data, but backpropagation is stopped after the FC layers, allowing the new layers to begin learning patterns from the discriminating convolutional layers.Training is then allowed to continue until sufficient accuracy is obtained.Figure 1 illustrates the schematic diagram of the fine-tuning process.

EXPERIMENTAL SETUP
This section discusses the experimental settings for the conducted experiments.The ear datasets, model settings, preprocessing steps including, data augmentation, and evaluation metrics are mentioned in the succeeding subsections.

Datasets
Human ears have various textures, colors, and shapes, making ear images distinctive for people and allowing the identity prediction.Moreover, occlusions, changes in illumination, perspective, and resolution are a few additional variables that contribute to the diversity of ear images.
Generally, the selection of the ear dataset and the degree to which these variations are present and controlled have a significant impact on the overall difficulty of conducting identity prediction.Although there are numerous ear datasets, the number of images they offer is still constrained.Because there are few data to account for the significant intraclass variation, uncontrolled datasets pose a problem for deep learning.Two representative datasets were used to train and evaluate our proposed models.The first dataset is the Mathematical Analysis of Images (AMI) dataset [47], which contains 700 ear images acquired from 100 subjects, where each subject has six images for the right ear and a single image for the left ear.The second dataset is the West Pomeranian University of Technology (WPUT) dataset [48], which contains 1960 cropped ear images taken from 474 subjects, where each subject has some images between 4 and 8 images.
Example images from (a) AMI and (b) WPUT datasets are shown in Figure 2.Moreover, a summarized description of these datasets is given in Table 1.

Experimental settings
We divided each dataset into two disjoint sets, training and test, each of which contained 60% and 40% of the ear images, respectively.The training set is used to adjust the weights of the various networks, while the test set is used to evaluate the models and report the results.Table 2 describes the different setup configurations for each considered deep CNN architecture on the used ear datasets.The CNN models are trained for a number of epochs ranging from 100 to 250 with a batch size of 16.The ReLU activation function is used in the newly added layers.The adaptive moment optimizer (Adam) is used for optimizing all models.For the AMI dataset, all models are trained for 250 epochs except DenseNet201, which converges after 100 epochs.However, for the WPUT dataset, all models are trained for 250 epochs except InceptionV3, which requires 200 epochs to converge.The convergence of a neural network is the moment in training a model after which adjusting the learning rate becomes less significant and the errors produced by the model reach the lowest level of tolerable error.All models were trained on a machine with Intel (R) Core (TM) i7 CPU, 16 MB RAM, and Nvidia RTX until convergence.

DATA AUGMENTATION
Data augmentation is a technique used in deep learning to increase the size of a training dataset by creating new variations of the existing data.This technique is particularly useful when the available dataset is small, and the model must generalize well to unseen data.The process of data augmentation involves creating new training examples by applying various transformations to the original data.These transformations may include: 1. Flipping: flipping the image horizontally or vertically.2. Rotation: rotating the image to a certain degree.
3. Zooming: cropping and scaling the image to create a zoomed-in or zoomed-out version.4. Translation: shifting the image horizontally or vertically.
5. Brightness adjustment: increasing or decreasing the brightness of the image.6. Contrast adjustment: increasing or decreasing the contrast of the image.7. Adding noise: adding random noise to the image.8. Shearing: distorting the image by shearing it in one or more directions.These transformations create new variations of the original image that can help the model generalize better.For example, flipping an ear image horizontally creates a new ear image of the same person from a different angle, which can help the model learn to recognize ear images from different angles.For our experiments, we applied simple transformations, which include horizontal and vertical flipping, and shifting the height and width with a small range (i.e., 0.3 in case of AMI dataset and 0.2 in case of WPUT dataset).

Performance evaluation
Evaluation metrics for an ear recognition system can include accuracy, precision, recall, and F1-score.Here are the definitions and formulas for each metric.Here, TP, TN, FP, and FN refer to true positives, true negatives, false positives, and false negatives, respectively. 1.
Accuracy: The proportion of correctly classified ear images over the total number of images.
Precision: The proportion of correctly classified positive cases (ear images) over the total number of positive cases predicted by the system.
Recall (or sensitivity): The proportion of correctly classified positive cases (ear images) over the total number of positive cases in the dataset.

Experimental results
In an ear recognition system, accuracy and loss curves are used to evaluate the performance of the system during training and testing.The accuracy curve shows how well the recognition system is able to correctly classify input, as well as to evaluate the overall performance of the recognition system and to determine if it is improving or getting worse over time.The loss curve indicates how well the system minimizes the difference between predicted and true outputs.Figure 3 shows the accuracy and loss curves for the CNN models used when training and validating on AMI ear dataset.It can be seen from the accuracy curves that the performance of the models increases with time, indicating that they are learning.We also observe that they improve at the beginning, but over time they reach a plateau, which means that they are not able to learn anymore.On the other hand, the loss curve over time measures the models' error or how our models are doing.Despite the slight ups and downs, in the long run, the loss gets smaller, indicating that the models are improving and learning.3 summarizes the obtained results for the different CNN models on the AMI ear dataset.As can be seen from the table, all models obtain accuracy, precision, and F1-score above 92% and a recall score of 94%.In addition, the highest performance with respect to all evaluation metrics was achieved by the DenseNet variants.

DenseNet201
Figure 4 illustrates the accuracy and loss curves for the CNN models used when training and validating on the WPUT ear dataset.Similarly, the performance of the models increases with time, indicating that the models are learning.We also observe that they improve at the beginning, but over time they reach a plateau.We notice the gap between the training and validation curves is a bit wider compared to those of the AMI dataset due to the wide variations encountered in the WPUT dataset.Moreover, the loss curves show a slight ups and downs at the begging of the learning process; however, over time, the loss gets smaller, indicating that the models are learning.
Table 4 presents the results for the different CNN models when conducting the experiments on the WPUT ear dataset.From the table, we observe that the best performance with respect to all evaluation metrics is again achieved by the DenseNet architecture.

Grad-CAM visualization
In recognition systems, it is crucial to comprehend and interpret the performance and logic behind network decisions.In this section, we deal with the interpretation of what the CNN models have learned by highlighting the regions that the models consider for prediction.The main goal is to check whether the models trained for ear recognition actually focus on the ears or whether they also use other auxiliary textures or details such as hair or skin parts.This can be tackled with the help of Grad-CAM (Gradient-weighted Class Activation Mapping) [49] visualization technique.Grad-CAM highlights image regions that strongly contribute to making a specific decision, by providing a heatmap.The Grad-CAM algorithm uses the gradients of the final convolutional layer of the CNN with respect to the output class score to generate a heatmap that indicates the importance of each pixel in the input image.By visualizing the Grad-CAM heatmaps, we can gain insights into which parts of the ear image are most important for the recognition decision.For example, we may find that the network focuses on specific regions of the ear, such as the helix or the lobule, that contain distinctive features for different individuals.This can help us design more effective ear recognition systems by focusing on the most informative regions of the ear and improving the network's ability to capture these features.
To apply Grad-CAM to ear recognition systems, we first trained a CNN on a large dataset of ear images, and then used Grad-CAM to generate heatmaps that highlight the regions of the ear image that contribute the most to the network's decision.Tables 5 and 6 illustrate Grad-CAM visualization for various examples of ear images from the AMI and WPUT datasets, where the models correctly identified the subjects.Similarly, we show some cases of misclassified ear images to gain insights into false predictions.It can be seen that, making a correct recognition decision when the model concentrates on the ear's geometrical structure as the most discriminative region when the model focuses on the geometrical structure of the ear as the most discriminative region.However, when the models focus on textures at the ear boundary or hairstyle, it leads to a wrong identification decision.

CONCLUSION
This work presented an ear recognition system based on state-of-the-art deep learning models.Different deep CNN architectures are utilized to improve the previous state-of-theart results on the AMI and WPUT ear datasets.Due to the limited number of ear images required to train the models from scratch, we adopted a transfer learning strategy and fine-tuned a set of pretrained deep architectures to overcome the limited training ear images.The DenseNet architecture yielded the highest recognition rate on both AMI and WPUT datasets.To increase the interpretability of the proposed models and gain some insights on what the models have learned, Grad-CAM visualizations are provided, which highlight the important ear image regions the models consider for prediction.The models emphasize the pinna when making correct decisions.It also appears a convenient phenomenon that models can make correct predictions when focusing on the geometric structure of the ear, even though they are not constrained to utilize only these features.However, using deep neural networks for feature learning has enabled the automatic and robust feature extraction from raw ear images, and can capture subtle differences among individuals.However, there are still some challenges and limitations in this field, such as the lack of large-scale annotated datasets, the vulnerability to adversarial attacks, and the generalization to unseen domains.In our future research, we will focus on improving the recognition performance even further on the considered ear datasets especially the WPUT dataset.This will be addressed by exploring different learning strategies and building specific and more effective deep CNN models.Moreover, we plan to address the problem of unconstrained ear recognition using large-scale ear image datasets.
Fine-tuning is a technique used in deep learning to adapt a pretrained model for a new task or domain.Pretrained models are trained on large datasets and can often learn general features useful for other tasks beyond their original training objectives.Fine-tuning involves taking a pretrained model and training it on a new task or dataset, often with a smaller amount of training data than the original training data.This approach can significantly improve the performance of a model on a new task, while reducing the amount of data and training time required compared to training a new model from scratch.

4 .
Train the model: Train the modified model on the ear recognition task or dataset, often with a smaller learning rate than the pretrained model, to avoid overfitting.5. Fine-tune the model: Gradually unfreeze more layers in the pretrained model and continue training until the model achieves the desired performance on the new task.

Figure 1 .
Figure 1.Schematic flow of fine-tuning pretrained CNN models on ear images from the WPUT ear dataset

DenseNet121InceptionV3Figure 3 .ResNet101V2Figure 4 .
Figure 3. Accuracy and loss curves for the different CNN architectures on the AMI ear dataset

Table 1 .
Description of ear datasets

Table 2 .
Parameter configuration for the deep CNN models on two ear datasets

Table 3 .
Results obtained from different deep CNN models on AMI dataset

Table 4 .
Results obtained from different deep CNN models on WPUT dataset

Table 5 .
Grad-CAM localization to illustrate the important ear image regions for making a recognition decision on the AMI dataset

Table 6 .
Grad-CAM localization to illustrate the important ear image regions for making a recognition decision on the WPUT dataset