Visual Object Tracking Using Siam and TensorFlow as Hybrid Model in AI

Visual Object Tracking Using Siam and TensorFlow as Hybrid Model in AI

Utkarsh Dubey* Raju Barskar

Department of CSE, University Institute of Technology, RGPV Bhopal 462033, India

Corresponding Author Email: 
utkarshdubey7@gmail.com
Page: 
2401-2416
|
DOI: 
https://doi.org/10.18280/ts.420447
Received: 
30 December 2024
|
Revised: 
28 May 2025
|
Accepted: 
15 June 2025
|
Available online: 
14 August 2025
| Citation

© 2025 The authors. This article is published by IIETA and is licensed under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).

OPEN ACCESS

Abstract: 

Tracking object from various challenges is a key motivation in computer vision and machine learning. It is bit rigorous to fulfill all the challenge with higher level of precision in considering all frames. An immaculate approach is required to build a model for better visual object tracking. It is required to obtain the patterns in each frame for ideal model. Here the research uses Siamese Network and TensorFlow to train and build the model. Siamese Network may contain two or more identical sub-networks that can compare the input and make decision more precise. It is required to pipelining the anchors with positive and negative sources to process the model towards hybrid one for processing the corresponding image. TensorFlow helps to gather the patterns of the objects and recognize it to pertain the same till last frame. Hypothesis is tested with various benchmarks including OTB50, OTB100 and TempleColor128 that pertained better level of precision.

Keywords: 

visual object tracking, Siamese Network, TensorFlow, machine learning, OTB50, OTB100, TempleColor128

1. Introduction

Visual object tracking is a bit complicated task in computer vision because it involves tracking objects as per the interest with pattern recognition. The motive of object tracking is to project the trajectory of the object along with the position by facing various challenges such as variations in lighting condition, fast motions, getting obstacles over the objects, etc. Tracking algorithm usually works in that manner where it initializes the target object in the very first frame along with its location by using coordinates and other pertinent properties continually in the following frames. It also includes feature extraction, object visualization, pattern following and model prediction. Numerous fields, such as robotics, autonomous navigation, augmented reality, video analysis, surveillance, and human-computer interaction, use visual object tracking extensively [1-3].

It may be used, for example, to guide unmanned aerial vehicles, monitor people and objects in congested areas, analyze sporting events, improve immersive gaming, and provide assistive devices for the blind. Visual object tracking has come a long way in recent years, but it is still a difficult challenge since real-world scenarios are inherently complex and need balancing accuracy, efficiency, and resilience. In order to push the limits of object tracking, researchers are still creating and improving edge tracking algorithms by utilizing developments in deep learning, probabilistic modeling, optimization strategies, and sensor technologies [4].

Several methods are used in visual object tracking to precisely track and identify things throughout a series of frames in a video. Feature extraction techniques, appearance descriptors or keypoints, are commonly employed in these procedures to extract unique attributes from the target object. Motion estimation techniques compute the movement of an item in between frames, enabling placement that is predicted. To capture the item's spatial extent, object representation methods like bounding boxes or pixel-wise masks are used. Similarity metrics assess how an object appears or is composed differently in different frames. To ensure stable and dependable tracking performance, tracking algorithms frequently include techniques for managing occlusions, scale changes, and other difficulties that are frequently encountered in real-world circumstances [5]. The compact visibility can observe several difficulties that work faces during tracking an object using pattern recognition or the features present in the image. It is also required to find out the background of the image to properly segment the foreground for better precision and accuracy.

Figure 1 shows a desk with gadgets, where a yellow box highlights one object to demonstrate object tracking.

TensorFlow makes it possible to use deep learning-based algorithms, which makes visual tracking easier. First, target objects for tracking are defined by tagged datasets of pictures or videos. TensorFlow provides a range of pre-trained models, such as SSD and Faster R-CNN, or uses its high-level APIs, such as TensorFlow Keras to enable the development of bespoke models. The prepared datasets are used to train these models, improving their capacity to forecast bounding boxes around tracked objects with precision. The trained models analyze fresh frames and produce predictions for object positions during inference. These predictions can be improved by post-processing methods like smoothing trajectories or removing false positives. Ultimately, the bounding boxes or monitored item coordinates are included into more expansive systems for further examination or utilization.

Figure 1. Visual object tracking [6]

2. Related Works

Li et al. [7] presented a network that uses the MA-Dual technique for object tracking and analyzes patterns across frames by using a spatial transient approach. Throughout the whole dataset, the approach extracts structural information using 3D convolutional processing. However, in certain datasets, issues like motion blur and low resolutions make reliable object identification difficult. varied data highlighting under varied brightness circumstances may cause variations in system performance. Comprehensive experiments using UAV123, OTB benchmarks, VOT, and TC128 datasets show the resilience of the approach. Findings show that tracking performance is promising, particularly when managing difficult circumstances including deformation, size variation, and lighting variations.

Zheng et al. [8] presented the Gaussian Process Regression Based Tracker (GPRT), a conventional tracking strategy that uses part tricks and removes boundary effects. The authors improved the performance of GPRT by introducing two effective updating strategies. Analyses performed on the OTB-2013 and OTB-2015 datasets demonstrate that GPRT outperforms trackers using hand-crafted features, with mean overlap accuracy of 84.1% and 79.2%, respectively. By utilizing Gaussian Regression Processes in visual tracking, GPRT presents a unique tracking solution that doesn't require complex add-ons. GPRT not only fully removes boundary effects but also efficiently utilizes the part trick compared to all other CF trackers. The authors also present two unique and effective GPRT updating methods. Two benchmark datasets, OTB-2013 and OTB-2015, with more than 100 films featuring a variety of challenges—birds, bolts, boxes, automobiles, bikers, blurred bodies, football, human, dudek, david, crowds, and more—are used for extensive testing.

Danelljan et al. [9] presented a probabilistic regression framework for tracking in which, given an input picture, the network predicts the restricted probability density of the destination state. The system's architecture is particularly capable of handling noise that results from vague annotations and unclear assignments. The Kullback-Leibler divergence is minimized in order to train the regression network. The framework not only enables a probabilistic representation of the output when used for tracking, but it also greatly enhances performance. On six datasets, the system's tracker achieves a new benchmark with an AUC of 59.8% on LaSOT and a Success rate of 75.8% on TrackingNet.

Chen and Tao [10] suggested a convolutional network-based regression technique for monitoring moving objects. Edge regression is used by the system to extract textures and object patterns for tracking. Furthermore, it combines layered convolutional methods with a backpropagation model. To capture the object's integration features, each layer in the DCF model is customized or trained using a variety of viewpoints. The object tracking system addresses a range of difficulties and sizes by means of repeated iterations and backpropagations. With just one convolutional layer, our technique offers a unique way to comfortably simulate regression in visual tracking.

Millan et al. [11] presented a system for multi-target tracking based on recurrent neural networks (RNNs), providing a novel solution to a number of issues this job presents. Deducing a changing number of targets over time, keeping a continuous state evaluation for every target that is present, and resolving a discrete combinatorial issue are all necessary for tracking many objects in actual scenarios. Earlier techniques frequently use intricate models that need time-consuming parameter adjustment. The authors provide an end-to-end learning architecture for online multi-target tracking, which differs from conventional methods. They point out that current deep learning techniques are not naturally equipped to address these issues and are not easily transferable to the current job.

Yun et al. [12] presented a system built on a deep reinforcement learning algorithm and suggested a novel activity-driven method for visual tracking using deep convolutional networks. The suggested tracker follows the target item iteratively through successive activities while being restricted by an ADNet. The computational complexity of tracking is greatly decreased by using this activity-driven tracking strategy. Furthermore, partly labeled data may be used with reinforcement learning, which might significantly improve training data generation with little work. The evaluation findings show that the proposed tracker outperforms existing deep network-based trackers using a tracking-by-location method by three times, achieving state-of-the-art performance at 3 frames per second.

Like the methods of other writers, he put out a Deep Reinforcement Learning based architecture. They provide a completely end-to-end technique for visual tracking in films that figures out where a target object's bounding box will be at every frame. Considering tracking as a sequential, dynamic process in which past semantics contain very important information for the future is a crucial realization. They create a model that functions as a recurrent convolutional neural network that interacts with video over time by utilizing this intuition. Long-term tracking performance can be improved by teaching the model tracking rules that concentrate on continuous frame boxes through the use of Reinforcement Learning (RL) methods. Operating at increased frame rates, the suggested tracking algorithm achieves state-of-the-art performance on an existing tracking benchmark.

Henriques et al. [13] presented a method based on Kernelized Correlation Filters that showed how normal picture interpretations may be represented in a methodical manner. They demonstrated that under some circumstances, bit frameworks and subsequent information become circulant, making it possible for the Discrete Fourier Transform (DFT) to diagonalize them and to quickly develop algorithms for handling interpretations. The authors developed cutting-edge trackers that operate at high frame rates and need less code implementation by using this approach to linear and patch-based regression. Expansions of their fundamental methodology are probably advantageous in a number of additional issues. Circulant data has proven useful for a number of methods in detection and video event retrieval since this work's original iteration.

Zheng et al. [14] presented an innovative tracking structure dubbed GPRT, which makes use of Gaussian Regression Processes for visual tracking, and he proposed a system based on the Robust Gaussian algorithm. In contrast to existing CF trackers, GPRT concurrently uses part techniques and removes boundary effects. The authors demonstrated the efficacy of two unique, effective GPRT strategies they proposed. Extensive analyses were carried out on the OTB-2013 and OTB-2015 benchmark datasets, where GPRT surpassed all current trackers.

Li et al. [15] presented a Dual Regression model-based system. The authors presented a dual-regression tracking architecture that consists of a CF module and a discriminative convolutional neural network in their study. By learning to distinguish between the target and background, this tracker becomes more discriminative and uses a fully convolutional network to reduce processing overhead. A CF module works on shallower layers with better spatial resolution to fine-tune the target position since deeper layers of the fully convolutional network preserve less spatial detail. This two-stream approach, which uses a single forward CNN pass to estimate the target location, makes deep trackers more effective. The efficacy and efficiency of the suggested approach are demonstrated by evaluation on three publicly available datasets.

Zhang et al. [16] presented a Support Vector Regression (SVR)-based approach. In order to regulate uncalibrated visual servoing for 3D motion tracking, their article suggests a unique approach. In the picture plane, motion based on PI control is initially used. The visual mapping model is then built using SVR. Finally, the continuous mapping approach is used to provide both planar and three-dimensional motion tracking. When compared to conventional BP neural network techniques for 3D motion visual tracking, SVR proved to have good approximation skills, especially when learning from tiny samples.

Zhang et al. [17] suggested a visual tracking method that takes rank loss, shrinkage loss, and optimum feature training into account. They matched target features using the template technique and adjusted their tracking accordingly. But depending just on template-based techniques could result in less than ideal results because these trackers are frequently restricted to particular objectives and have difficulty handling a variety of datasets. In addition to highlighting the necessity of improving the model by optimizing filters to increase processing time, frame rate, and accuracy, the authors propose upgrading the template using preprocessing techniques. To assess the correctness of the system, Templecolor128 will be used for testing. Compared to traditional approaches, object tracking may be greatly aided by object detection since characteristics or patterns can be tracked more successfully.

New developments in Kernelized Correlation Filter (KCF)-based tracking have attempted to address issues such as limited feature representation occlusion and scale variation. Danelljan et al. [18] suggested a better KCF algorithm that uses the MCMRV criterion to incorporate a multi-scale pyramid and an adaptive template update mechanism. This greatly improves the algorithms robustness against occlusion and scale changes while maintaining a high processing speed appropriate for embedded platforms. Zheng et al. [19] improved tracking performance in dynamic scenes with frequent occlusions by introducing an occlusion-aware KCF tracker that incorporates RGB-D information in 2021. Also, Yadav [20] KCF was improved in 2023 by combining deep features from VGG16 which improved the tracker's performance in visually complex and cluttered environments. Maharani et al. [21] additionally, in 2021 long-term tracking problems were addressed by implementing multi-scale detection and Lab color features which decreased model drift and increased stability over long periods of time.

Improvements in methodology have also been made to Multiple Instance Learning (MIL)-based trackers in an effort to make them more resilient and flexible. Cheong and associates showed how different pooling strategies affect classification and tracking performance by comparing different MIL pooling filters such as max mean, attention and distribution-based approaches [22]. Oner et al. [23] used features like compressive tracking and histogram of oriented gradients (HOG) combined MIL, with a parallel tracking and detection framework, improving the tracker's dependability in challenging scenes. Besides, Xiong et al. [24] assessed MIL in a real-time tracking framework on a Raspberry Pi 3 Model B+ platform, demonstrating that although MIL occasionally provided good accuracy, it had trouble maintaining high frame rates on constrained hardware.

3. Proposed Work and Implementation

This research proposes a hybrid network based on Siamese and TensorFlow, designed for offline processing with large datasets. The network comprises multiple sub-networks for feature extraction, regression, and classification. By integrating the Siamese Network with TensorFlow, an object detection approach, the feature extraction model is enhanced for improved visual tracking analysis. Unlike previous recognition frameworks that repurpose classifiers or localizers for feature extraction, this model applies features across multiple areas of an image, even when objects are scaled. The system will undergo testing with various datasets and benchmarks, including OTB50, OTB100, and TempleColor128, aiming to achieve higher levels of accuracy.

3.1 Problem definition

Even though visual object tracking has advanced significantly consideration to methods like deep learning regression models and reinforcement learning a number of significant obstacles still exist. Among these are the challenges of preserving robustness in the face of unfavorable circumstances like motion blur, low resolution, fluctuating lighting and object deformation. The real-world applicability of many current methods is limited by their struggles with boundary effects occlusion and generalization across diverse datasets. Furthermore, intricate designs are frequently unsuitable for real-time applications and necessitate extensive tuning. Effective mechanisms to manage long-term dependencies and update models effectively during tracking are also lacking. Tracking accuracy and speed balance is still a big concern which emphasizes the need for more flexible lightweight and dependable tracking systems that can function well in a variety of situations. 

3.1.1 Method design

To precisely localize the target object in every frame the suggested method design for visual object tracking combines a lightweight regression framework with a feature extraction module based on Convolutional Neural Networks (CNNs). First input frames undergo preprocessing, which includes normalization and resizing and a region of interest is chosen for effective calculation. Selected CNN layers that strike a balance between semantic richness and spatial detail are used to extract deep features. After that the target bounding box is predicted by a regression-based tracking module. To account for variations in appearance scale and occlusion, an adaptive update strategy uses confidence-based online learning to improve the model. A re-detection mechanism is incorporated for tracking failure recovery and frame-wise predictions are smoothed to improve temporal consistency. Stochastic gradient descent is used to optimize the model after it has been trained using a combination of regression and similarity-based loss functions. Metrics like precision AUC, and success rate are used to assess the model's robustness and real-time capability on benchmark datasets like OTB VOT, and LaSOT.

3.1.2 Experimental validation

To ensure a thorough performance evaluation under a variety of real-world scenarios the proposed visual object tracking system was experimentally validated on well-known benchmark datasets such as OTB-2013 OTB-2015 VOT LaSOT and UAV123. Standard metrics like accuracy success rate intersection over union (IoU) and area under the curve (AUC) were used to gauge the systems performance. To demonstrate gains in robustness, adaptability and tracking accuracy especially in the face of difficult circumstances like motion blur occlusion scale variation and illumination changes a comparative analysis was conducted against a number of cutting-edge trackers. Across sequences with different object classes motion patterns and scene complexities the suggested method continuously outperformed baseline approaches exhibiting superior tracking stability and efficiency while preserving real-time performance with little computational overhead.

3.2 Siamese network

Common applications of the Siamese neural network include object tracking and similarity learning. The fundamental design consists of two similar neural networks with the same topology and weights, often referred to as Siamese twins or twin networks. The Siamese neural network may be represented mathematically in the following way:

Let $f(x ; \theta)$ stands for the function that the neural network learnt, where x is the input and θ is the network's parameters (weights and biases). MThe modelhas two identical networks sharing the same parameters for a Siamese Network. Let's say that the inputs to the first and second networks are represented by the symbols x1 and x2, respectively. The output of each network is a feature vector, denoted as $f\left(x_1 ; \theta\right)$ and $f\left(x_2 ; \theta\right)$ espectively. In other words, the following mathematical representation captures the fundamental structure of a Siamese Neural network:

$f\left(x_1 ; \theta\right)=\operatorname{Net}\left(x_1 ; \theta\right)$              (1)

$f\left(x_2 ; \theta\right)=\operatorname{Net}\left(x_2 ; \theta\right)$              (2)

$S\left(x_1, x_2\right)=$ Similarity $f\left(x_1 ; \theta\right), f\left(x_2 ; \theta\right)$              (3)

where, $f\left(x_1 ; \theta\right)$ is the feature vector of first input with the network parameter $\theta$, similarly for the $f\left(x_2 ; \theta\right) . S\left(x_1, x_2\right)$ is the similarity between the first input vector and the second one.

In actuality, depending on the particular job and application, the network design, loss function, and optimization technique may change. Additionally, Siamese Networks are frequently trained for similarity learning tasks using methods like contrastive loss and triplet loss. In order to extract image characteristics, the Siamese Network uses the categorization network, which biases the retrieved features toward semantic information. SiameseRPN++ uses ResNet as its feature extraction backbone network. The findings of the experiment confirm that various channels respond differently to distinct object categories, suggesting that deep features can capture semantic information associated with object prejudice. The Siamese Network's dual-branch structure allows it to interpret input picture data in a different way, producing characteristics that take various dimensions, such as channel and space, into account. By using the attention mechanism to filter visual data, the network is able to determine the object's significance during feature extraction, highlighting target features that are pertinent to the tracking job and ignoring background information. The most adorable feature of Siamese Network is offline learning approach. It trains the model in offline mode and test in the same manner. The template branch and the detection branch are the two branches that make up the Siamese tracker. While the detection branch examines the target image patch from the current frame, the template branch receives the target image patch from the previous frame as input. To track an object, first of all, it is required to preprocess the image using smart filters and localize the image by inputting the target object in the very first frame. Figure 2 illustrates the box marks target object initialization in object tracking selecting the ball as the focus to follow in the video.

Figure 2. Target object initialization

Figure 3. Color mapping of target feature object

The process of extracting features began when the target object was established. Using a set of pixels with comparable spectral, spatial, and/or textural qualities, an object (or segment) is used in feature extraction, an object-based method for classifying images. Figure 3 shows the color mapping highlights the ball’s features to initialize it as the tracking target.

On the other hand, conventional classification techniques are pixel-based, categorizing images based on the spectral information of individual pixels.

Figure 4. Task specific model training

Once the feature extraction has been done, then feature training will be started, and with the help of Siamese Network; task-specific training will be accomplished and generate the output model. Figure 4 illustrates task-specific model training in object tracking.

3.3 TensorFlow

For object identification tasks, TensorFlow is a powerful tool with an extensive ecosystem of models and tools. Choosing an appropriate model architecture, such as Faster R-CNN, SSD, or YOLO, which each have their trade-offs between speed and accuracy, is usually the first step in the process.

Convolutional neural networks (CNNs) are used by these designs to forecast bounding boxes and class labels for objects that are recognized, as well as to extract characteristics from input photos. Using hierarchical features that capture semantic information about the objects in the image, CNNs process the input image. Convolutional and pooling layers, which gradually decrease the spatial dimensions while increasing the depth of feature maps, are commonly used to achieve this. The model predicts bounding boxes, or coordinates, that closely surround items of interest inside the picture once features are retrieved. Figure 5 shows the main steps in TensorFlow object detection: extracting features, finding and classifying objects, then filtering results. Regressing the coordinates of a group of predetermined anchor boxes to better suit the position of the item is a common method for doing this. At the same time, the model classifies objects by giving each bounding box a class probability that indicates how likely it is to include a certain item category. In order to compute class probabilities, this is often accomplished using extra convolutional layers and a softmax activation function. Non-Maximum Suppression is used to filter out overlapping bounding boxes with lower confidence ratings in order to get rid of redundant detections. This guarantees the retention of just the most certain detections. Three phases are usually included in the object detection operation:

(a) The way that the input is divided into manageable chunks. Figure 6 demonstrates image segmentation: the grid divides the cat photo into distinct regions, helping separate and analyze parts of the image for tasks like object recognition. The whole image is covered by the extensive collection of bounding boxes, as shown.

(b) For each segmented rectangular area, feature extraction is conducted to determine whether the rectangle contains a valid object. Each box shows a region where features are extracted for later recognition or tracking is presented in Figure 7. Figure 8 depicts Non-Maximum Suppression merges overlapping detection boxes into one, creating a single rectangle around the cat.

Non-Maximum Suppression creates a single boundary rectangle by joining overlapping boxes.

Figure 5. Visual object detection in TensorFlow

Figure 6. Segmentation of an image

Figure 7. Feature extraction of all boxes

Figure 8. Object detection using non-max suppression

3.4 Architecture of Siamese Network

Siamese Networks are a subclass of networks defined by the use of two sub-networks that are exactly the same for one-shot classification tasks. Even though these sub-networks analyze distinct inputs, they retain the same designs, parameters, and weights. Siamese Networks learn a similarity function, in contrast to typical CNNs that are trained on large datasets to predict various classes. They can distinguish between classes with less data thanks to this feature, which makes them quite effective for one-shot classification tasks. These networks are frequently able to categorize pictures effectively with only one sample thanks to their amazing feature. Instead of using a lot of labeled data for training as in the standard technique, few-shot learning trains models to make predictions using only a limited number of instances. When it becomes difficult or expensive to gather large amounts of labeled data, the value of few-shot learning becomes clear. Few-shot models may produce predictions with little input sometimes even from a single example because of its design, which captures the intricacies seen in a tiny sample size. This feature is made possible by several design processes including Siamese Networks, Meta-learning, and related techniques. With the help of these frameworks, the model is able to derive insightful data representations and apply them successfully to new, untested samples.

The goal of the differencing layer is to draw attention to the differences between dissimilar pairs while highlighting the commonalities between inputs. The Euclidean Distance function is employed to do:

Distance $\left(x_1, x_2\right)=\left\|\mathrm{f}\left(x_1\right)-\mathrm{f}\left(x_2\right)\right\|_2$          (4)

Here, x1 and x2 represent the two inputs, Encoding (x₁) and Encoding(x2) denote the output of the encoding function, and Distance represents the distance function. The entropy loss can be encountered as:

$-(y \log (p)+(1-y) \log (1-p))$               (5)

where,

y represents the true label,

p represents the predicted probability

$K(i, j)=\frac{\sum_{x=1}^N \sum_{y=1}^N(x, y) \times T(x, y)}{\sqrt{\sum_{x=1}^N \sum_{y=1}^N[S(x, y)]^2} \sqrt{\sum_{x=1}^N \sum_{y=1}^N[T(x, y)]^2}}$               (6)

where,

T(x,y) denotes the template image,

S(x,y) represents the search region of the target, and

K indicates the height and width of the data.

The response map is obtained by using a fully-convolutional Siamese Network, as shown in the flow chart of the Siamese framework. The re-detection network is initiated if the secondary peak in this map is more than 0.75 times the size of the main peak.

3.5 Process model

To maximize input quality, the system first gathers frames for preprocessing. The networks are obtained and loaded to recognize objects based just on their appearance once all preprocessing activities have been completed. Figure 9 shows a Siamese network architecture for object detection, where parallel networks compare a template and a detection frame to localize and track the target object.

Figure 9. Object detection Siamese Network architecture

After that, an additional network is utilized for effective object tracking. The system makes use of two distinct strategies that are well-known for their effectiveness in tracking and object identification, which helps to improve the regression model. This minimizes overlap counts and increases frame rate per second while maintaining excellent system efficiency. Figure 10 represented the proposed flow chart.

Figure 10. Proposed flow chart

Algorithm 1: Siamese & TensorFlow Algorithm

Initialization

Input: Frames

Output: (X, Y) Coordinates

Step 1. Initialize:

   - Input the first frame of the dataset.

Step 2. Convert:

   - Convert the RGB image to grayscale.

Step 3. Setup:

   - Define the target area or object according to the dataset.

Step 4. Frame Sequences:

   - Establish frame sequences {Xt}T, where X1 represents the initial frame.

Step 5. Model Loading:

   - Load the Siamese and TensorFlow models.

Step 6. Iterate:

   - For each iteration (I) from 1 to N:

     - Search for the target object in the frames.

     - If the detection confidence exceeds the threshold value:

       - Increment the count for overlapping detections.

     - Else:

       - Increment the count for accurately regressed objects.

Step 7. Extraction:

   - Extract coordinates: Cx (width), Cy (height), X, and Y (coordinates of the target object).

Step 8. Comparison:

   - Compare extracted coordinates with ground truth.

Step 9. Accuracy Calculation:

   - Compute system accuracy using counts of loss and true regression.

Step 10. End:

    - Terminate the algorithm

First, frames from a dataset are entered and converted from RGB to grayscale using the Siamese Network and TensorFlow technique. After that, it creates frame sequences and specifies the target region or object in accordance with the dataset criteria. In order to start the detection procedure, the algorithm loads the TensorFlow and Siamese Network models. It looks for the target item inside the frames throughout each cycle. It counts successfully regressed items and increases the count for overlapping detections if the detection confidence is greater than a predetermined threshold. The target object's width, height, and (x, y) coordinates are extracted by the method. It then evaluates accuracy by contrasting these extracted coordinates with the ground truth. Lastly, it uses counts of loss and true regression to calculate the accuracy of the system before terminating the algorithm. To guarantee efficient object tracking the Siamese Network and TensorFlow hybrid tracking model was put into practice with particular training setups and parameters. To filter and improve detection results Non-Maximum Suppression (NMS) was applied with a confidence threshold of 0. 5 and an IoU threshold of 0. 4. The Adam optimizer was used to optimize the Contrastive Loss function over 50 epochs with a batch size of 32. The learning rate was set at 1e-4. It used a modified AlexNet as the feature embedding backbone and standardized the input images to 127×127 for the exemplar (template) and 255×255 for the search region. The algorithmic flow was made more understandable by adding a pseudocode representation that detailed the steps involved from feature extraction and similarity calculation to score thresholding and final bounding box selection. All of these specifics improve the suggested methods transparency and reproducibility.

4. Results

TensorFlow version 2.12 and Python 3.10 were used for the implementation, which was run on a hardware configuration that included an Intel Core i7 processor 10GB VRAM, 16 GB system’s RAM, and an NVIDIA RTX 3060 GPU. Effective model evaluation and training were guaranteed by this setup. The model was trained with a learning rate of 1e-4 a batch size of 32 and for a total of 50 epochs in order to support reproducibility. The training parameters and other hardware and software specifics are documented for clarity and to enable others to precisely reproduce the experimental conditions. The suggested tracker is assessed using both qualitative and quantitative methods. Several tests are performed on a collection of twenty-three chosen movies from three well-known datasets: OTB-50, OTB-100, and Temple Colour-128. A variety of visual difficulties may be seen in these films, such as cluttering, out-of-plane rotation, occlusion, scale fluctuations, deformation, rapid motion, and motion blur. For experimental validation the OTB-50 OTB-100 and TC128 datasets were chosen based on their usage in the base paper, guaranteeing consistency and comparability of findings. These datasets are well-known within the visual tracking community and offer a variety of tracking scenarios that are crucial for assessing the robustness and performance of tracking algorithms.

OTB-50 and OTB-100 are perfect for evaluating short-term tracking abilities under a variety of difficult circumstances because they provide sequences with annotated attributes like occlusion illumination variation scale changes and fast motion. By adding increasingly intricate and cluttered environments, TC128 expands on this assessment and puts the algorithms flexibility and accuracy to the test. We guarantee a fair and straightforward performance comparison by utilizing the same datasets as the base study, confirming the efficacy of the suggested approach in comparable experimental conditions. The coordinates from the suggested system are used to calculate the results, which are then compared to the ground truth values. Two benchmarks, OTB50, OTB100, and TempleColor128, each including several films with thousands of frames, are used to assess the suggested method. Specific problems that align with the benchmark qualities are presented in each video. Eleven characteristics are commonly employed to evaluate the correctness of the system, such as overflow and accurate object detection at every frame. If the bounding boxes fall beyond the window or surpass a threshold value, an error occurs that reduces accuracy. Frame per Second (FPS) has been measured to obtain the execution time. Success and precision are two parameters through which results have been computed accordingly.

Success is evaluated on the basis of how current bounding box overlaps with the ground truth bounding box. There is a parameter Intersection over Union (IoU) that lies between the predicted outcome and the ground truth.

$ Success =\frac{Area \, of \, Overlap}{Area \, of \, Union}$

IoU is encountered when it exceeds the threshold value i.e. 20 pixels. Whereas precision is calculated on the basis of the center location of the bounding box as well as the ground truth. There is only a difference is the threshold where 50 pixels are considered for precision.

$Precison=\frac{Number \, of \, frames<threshold}{Total \, number \, of \, frames}$

If IoU ≥ 0.5, then 0.5 is considered as the ground truth threshold. Each video contains a total number of frames, and if the bounding box fails to track the target object, the frames in which the target is lost are counted as loss frames.

$Accuracy=\frac{Total \, Number \, of \, Frames-Total \, Predicted \, Loss \, Frames}{Total \, Number \, of \, Frames} \times 100 \%$

$ Loss Count =100- Predicted \, Accuracy \%$

Figure 11 shows sample frames from tracking datasets (OTB-50, OTB-100, Temple Colour-128), with colored boxes marking tracked objects in each scene.

Figure 11. Selected videos from OTB-50, OTB-100, and Temple Colour-128 datasets

Tables 1-6 present the recorded performance metrics across three datasets. Tables 1 and 2 show the accuracy and loss count for OTB-50, respectively. Tables 3 and 4 provide the corresponding metrics for OTB-100, while Tables 5 and 6 detail the accuracy and loss count for TC-128.

Table 1. Recorded accuracy of OTB-50

Datasets

Accuracy

FPS

Datasets

Accuracy

FPS

Basketball

100.00

285.0

Human3

100.00

127.1

Biker

100.00

280.5

Human4

54.52

295.5

Bird1

46.43

271.1

Human6

93.15

243.6

BlurBody

99.40

206.9

Human9

100.00

300.4

BlurCar2

85.62

248.9

Ironman

80.77

286.0

BlurFace

64.42

233.0

Jump

96.69

275.5

BlurOwl

98.57

293.0

Jumping

100.00

301.7

Bolt

100.00

299.3

Liqour

59.95

184.1

Box

99.05

172.3

Matrix

89.80

262.1

Car1

100.00

235.8

MotorRollig

95.00

272.3

Car4

100.00

290.4

Panda

100.00

239.9

CarDark

100.00

304.7

RedTeam

100.00

193.4

CarScale

96.80

269.4

Shaking

100.00

294.9

ClifBar

96.40

297.8

Singer2

81.06

298.7

Couple

100.00

278.8

Skating1

89.13

290.6

Crowds

100.00

291.1

Skating2

72.13

269.3

David

81.09

263.0

Skiing

100.00

263.0

Deer

100.00

237.9

Soccer

92.59

260.0

Diving

98.60

284.2

Surfur

100.00

300.8

DragonBaby

89.19

270.3

Sylvester

100.00

206.7

Dudek

99.56

169.8

Tiger2

87.33

262.3

Football

55.15

248.7

Trellis

100.00

294.7

Freeman4

94.31

301.8

Walking

100.00

289.7

Girl

100.00

302.3

Walking2

100.00

302.9

 

 

 

Woman

97.65

301.7

Mean

 

91.72

 

Table 2. Recorded loss count of OTB-50

Datasets

Overlap

Datasets

Overlap

Basketball

0.00

Human3

0.00

Biker

0.00

Human4

45.48

Bird1

53.57

Human6

6.85

BlurBody

0.60

Human9

0.00

BlurCar2

14.38

Ironman

19.23

BlurFace

35.58

Jump

3.31

BlurOwl

1.42

Jumping

0.00

Bolt

0.00

Liqour

40.05

Box

0.95

Matrix

10.20

Car1

0.00

MotorRolling

5.00

Car4

0.00

Panda

0.00

CarDark

0.00

RedTeam

0.00

CarScale

3.20

Shaking

0.00

ClifBar

3.61

Singer2

18.94

Couple

0.00

Skating1

10.87

Crowds

0.00

Skating2

27.87

David

18.91

Skiing

0.00

Deer

0.00

Soccer

7.41

Diving

1.40

Surfur

0.00

DragonBaby

10.81

Sylvester

0.00

Dudek

0.44

Tiger2

12.67

Football

44.85

Trellis

0.00

Freeman4

5.69

Walking

0.00

Girl

0.00

Walking2

0.00

 

 

Woman

2.35

Mean

8.28

Table 3. Recorded accuracy of OTB-100

Datasets

Accuracy

FPS

Datasets

Accuracy

FPS

Bird2

100.00

267.1

Freeman1

100.00

294.9

BlurCar1

99.05

270.0

Freeman3

100.00

296.5

BlurCar3

100.00

294.0

Girl2

46.30

194.2

BlurCar4

93.62

239.3

Gym

100.00

256.1

Board

60.17

223.2

Human2

94.67

125.5

Bolt2

69.10

302.6

Human5

100.00

280.2

Boy

100.00

303.3

Human7

100.00

293.7

Car2

100.00

253.2

Human8

100.00

264.7

Car24

100.00

174.5

Jogging

100.00

283.0

Coke

93.75

285.2

KiteSurf

100.00

271.3

Coupon

41.25

291.0

Lemming

93.54

149.4

Crossing

100.00

286.0

Man

100.00

291.4

Dancer

100.00

265.0

Mhyang

100.00

196.9

Dancer2

100.00

273.5

MountainBike

100.00

290.5

David2

100.00

303.8

Rubik

97.59

117.8

David3

98.79

287.1

Singer1

100.00

252.1

Dog

87.50

275.3

Skater

100.00

264.0

Dog1

41.02

162.0

Skater2

96.06

270.6

Doll

97.60

164.8

Subway

100.00

296.9

FaceOcc1

41.83

239.8

Suv

100.00

289.7

FaceOcc2

76.72

260.8

Tiger1

99.43

255.6

Fish

100.00

284.6

Toy

98.51

285.2

Fleetface

66.10

217.7

Trans

50.00

191.9

Football1

100.00

267.6

Twinnings

99.79

295.1

 

 

 

Vase

88.76

270.4

Mean

 

90.43

 

Table 4. Recorded loss count of OTB-100

Datasets

Overlap

Datasets

Overlap

Bird2

0.00

Freeman1

0.00

BlurCar1

0.948

Freeman3

0.00

BlurCar3

0.00

Girl2

53.69

BlurCar4

6.38

Gym

0.00

Board

39.82

Human2

5.33

Bolt2

30.90

Human5

0.00

Boy

0.00

Human7

0.00

Car2

0.00

Human8

0.00

Car24

0.00

Jogging

0.00

Coke

6.25

KiteSurf

0.00

Coupon

58.75

Lemming

6.45

Crossing

0.00

Man

0.00

Dancer

0.00

Mhyang

0.00

Dancer2

0.00

MountainBike

0.00

David2

0.00

Rubik

2.41

David3

1.21

Singer1

0.00

Dog

12.50

Skater

0.00

Dog1

58.98

Skater2

3.94

Doll

2.40

Subway

0.00

FaceOcc1

58.17

Suv

0.00

FaceOcc2

23.28

Tiger1

0.57

Fish

0.00

Toy

1.49

Fleetface

33.90

Trans

50.00

Football1

0.00

Twinnings

0.21

 

 

Vase

11.24

Mean

9.57

Table 5. Recorded accuracy of TC-128

Datasets

Accuracy

FPS

Datasets

Accuracy

FPS

Airport_ce

62.16216

121.7

Kite_ce3

100

302.8

Baby_ce_gt

100

294.4

Kobe_ce

100

258.4

Badminton_ce1

100

304.3

Lemming

100

141.1

Badminton_ce2

100

251.8

Liquor

100

180.9

Ball_ce1

71.86701

261.2

Logo_ce

100

270.6

Ball_ce2

100

304.0

Matrix

74.19355

274.7

Ball_ce3

99.6337

252.8

Messi_ce

58.00

298.5

Ball_ce4

81.07527

304.6

Michaeljackson_ce

100

127.5

Basketball

100

268.4

Microphone_ce1

72.51908

284.9

Basketball_ce1

100

300.9

Microphone_ce2

100

288.8

Basketball_ce2

98.24561

299.2

Motorbike_ce

100

283.8

Basketball_ce3

89.14027

290.5

MotorRolling

100

269.6

Bee_ce

73.68421

270.4

MountainBike

71.95122

298.2

Bicycle

100

296.2

Panda

100

298.4

Bike_ce1

100

271.6

Plane_ce2

84.6473

301.8

Bike_ce2

100

268.0

Plate_ce1

87.09677

279.5

Biker

66.85083

259.4

Plate_ce2

100

286.1

Bikeshow_ce

94.5544

286.0

Pool_ce1

100

300.9

Bird

91.43646

258.1

Pool_ce2

100

294.5

Board

88.00

242.9

Pool_ce3

100

290.9

Boat_ce1

78.70968

290.2

Railwaystation_ce

14.51613

232.4

Boat_ce2

94.44444

297.6

Ring_ce

71.91283

298.8

Bolt

100

301.2

Sailor_ce

100

301.6

Boy

69.51567

305.0

Shaking

81.8408

295.4

Busstation_ce1

100

288.1

Singer_ce1

100

287.6

Busstation_ce2

95.6044

294.2

Singer_ce2

83.64486

171.1

CarDark

100

309.9

Singer1

90.32258

276.9

CarScale

100

287.7

Singer2

85.75499

300.3

Carchasing_ce1

100

236.4

Skating_ce1

73.49727

203.2

Carchasing_ce3

100

313.5

Skating_ce2

70.66015

95.70

Carchasing_ce4

100

301.8

Skating1

74.19355

292.7

Charger_ce

100

91.3

Skating2

90.75

273.5

Coke

69.46309

292.6

Skiing

93.11828

274.1

Couple

100

286.4

Skiing_ce

100

168.3

Crossing

98.57143

278.1

Skyjumping_ce

80.43011

250.7

Cup

100

295.6

Soccer

100

284.3

Cup_ce

100

290.1

Spiderman_ce

83.92857

227.0

David

71.30178

272.8

Subway

74.43182

284.3

David3

74.19355

293.9

Suitcase_ce

61.71429

247.9

Deer

100

227.4

Sunshade

81.52174

303.7

Diving

84.50704

279.3

SuperMario_ce

98.83721

253.3

Doll

93.93939

151.5

Surf_ce1

92.44186

204.6

Eagle_ce

100

290.7

Surf_ce2

94.05941

119.6

Electricalbike_ce

50.00

232.3

Surf_ce3

78.26087

246.9

Face_ce

100

139

Surf_ce4

69.17563

215.1

Face_ce

75.48387

224.1

TableTennis_ce

74.81481

308.5

Fish_ce1

35.13514

225.2

Tennis_ce1

89.39394

252.3

Fish_ce2

95.05376

286.7

Tennis_ce2

76.65198

238.8

FaceOcc1

100

276.5

Tennis_ce3

97.70492

243.8

Football1

98.27957

270.1

TennisBall_ce

83.82353

291.0

Girl

91.35802

299.3

Thunder_ce

95.13889

296.2

Girlmov

100

108

Tiger1

100

242.3

Guitar_ce1

93.97849

284.2

Tiger2

97.45763

280.8

Guitar_ce2

100

230.1

Torus

98.63014

297.3

Gym

79.8722

275.8

Toyplane_ce

100

252.7

Hand

100

275.4

Trellis

78.2716

305.2

Hand_ce1

69.67213

203.9

Walking

97.2043

295.6

Hand_ce2

99.25187

286.6

Walking2

100

293.0

Hurdle_ce1

100

237.7

Woman

96.12903

310.7

Hurdle_ce2

60.00

293.8

Yo-yos_ce1

100

293.4

Iceskater

100

261.5

Yo-yos_ce2

88.93617

286.1

Ironman

100

284.3

Yo-yos_ce3

82.15859

285.7

Jogging [1]

96.38554

292.5

 

 

 

Jogging [2]

75.57003

286.6

 

 

 

Juice

74.59283

301.9

 

 

 

Kite_ce1

100

307.2

 

 

 

Kite_ce2

100

309.7

 

 

 

Mean

 

89.07961

 

Table 6. Recorded loss count of TC-128

Datasets

Overlap

Datasets

Overlap

Airport_ce

37.83784

Kite_ce3

0

Baby_ce_gt

0

Kobe_ce

0

Badminton_ce1

0

Lemming

0

Badminton_ce2

0

Liquor

0

Ball_ce1

28.13299

Logo_ce

25.80645

Ball_ce2

0

Matrix

42

Ball_ce3

0.3663

Messi_ce

0

Ball_ce4

18.92473

Michaeljackson_ce

27.48092

Basketball

0

Microphone_ce1

0

Basketball_ce1

0

Microphone_ce2

0

Basketball_ce2

1.75439

Motorbike_ce

0

Basketball_ce3

10.85973

MotorRolling

28.04878

Bee_ce

26.31579

MountainBike

0

Bicycle

0

Panda

15.3527

Bike_ce1

0

Plane_ce2

12.90323

Bike_ce2

0

Plate_ce1

0

Biker

33.14917

Plate_ce2

0

Board

12

Pool_ce1

0

Boat_ce1

21.29032

Pool_ce2

85.48387

Boat_ce2

5.55556

Pool_ce3

28.08717

Bolt

0

Railwaystation_ce

0

Boy

30.48433

Ring_ce

18.1592

Busstation_ce1

0

Sailor_ce

0

Busstation_ce2

4.3956

Shaking

16.35514

CarDark

0

Singer_ce1

9.67742

CarScale

0

Singer_ce2

14.24501

Carchasing_ce1

0

Singer1

26.50273

Carchasing_ce3

0

Singer2

29.33985

Carchasing_ce4

0

Skating_ce1

25.80645

Charger_ce

0

Skating_ce2

9.25

Coke

30.53691

Skating1

6.88172

Couple

0

Skating2

0

Crossing

1.42857

Skiing

19.56989

Cup

0

Skiing_ce

0

Cup_ce

0

Skyjumping_ce

16.07143

David

28.69822

Soccer

25.56818

David3

25.80645

Spiderman_ce

38.28571

Deer

0

Subway

18.47826

Diving

15.49296

Suitcase_ce

1.16279

Doll

6.06061

Sunshade

7.55814

Eagle_ce

0

SuperMario_ce

5.94059

Electricalbike_ce

50

Surf_ce1

21.73913

Face_ce

0

Surf_ce2

30.82437

Face_ce2

24.51613

Surf_ce3

25.18519

FaceOcc1

64.86486

Surf_ce4

10.60606

Fish_ce1

4.94624

TableTennis_ce

23.34802

Fish_ce2

0

Tennis_ce1

2.29508

Football1

1.72043

Tennis_ce2

16.17647

Girl

8.64198

Tennis_ce3

0

Girlmov

6.02151

TennisBall_ce

4.86111

Guitar_ce1

0

Thunder_ce

0

Guitar_ce2

20.1278

Tiger1

2.54237

Gym

0

Tiger2

1.36986

Hand

30.32787

Torus

0

Hand_ce1

0.74813

Toyplane_ce

21.7284

Hand_ce2

0

Trellis

2.7957

Hurdle_ce1

40

Walking

0

Hurdle_ce2

0

Walking2

3.87097

Iceskater

0

Woman

0

Ironman

3.61446

Yo-yos_ce1

11.06383

Jogging [1]

24.42997

Yo-yos_ce2

17.84141

Jogging [2]

25.40717

Yo-yos_ce3

5.44554

Juice

0

 

 

Kite_ce1

0

 

 

Kite_ce2

0

 

 

Mean

10.92039

Tables 7-9 present the comparative results of mean accuracy across datasets. Table 7 shows the mean accuracy comparison for OTB-50, Table 8 for OTB-100, and Table 9 for TC-128.

Table 7. Result comparison of mean accuracy (OTB-50)

OTB50- Mean Success and Precision in %

Tracker

Success

Precision

SRDCF [25]

0.539

0.731

CFNet [26]

0.535

0.724

LMCF [27]

0.533

0.729

SiameseFC [28]

0.519

0.693

Staple [29]

0.506

0.683

LCT [30]

0.488

0.689

SAMF [31]

0.464

0.649

fDSST [32]

0.406

0.616

KCF [33]

0.403

0.610

SL [17]

0.543

0.757

Ours

0.631

0.917

Table 8. Result comparison of mean accuracy (OTB-100)

OTB100- Mean Success and Precision in %

Tracker

Success

Precision

SRDCF [25]

0.598

0.789

CFNet [26]

0.587

0.778

LMCF [27]

0.587

0.772

SiamFC [28]

0.578

0.784

Staple [29]

0.578

0.783

LCT [30]

0.558

0.761

SAMF [31]

0.548

0.750

fDSST [32]

0.517

0.686

KCF [33]

0.477

0.695

SL [17]

0.597

0.816

Ours

0.608

0.904

Table 9. Result comparison of mean accuracy (TC-128)

TC128- Mean Success and Precision in %

Tracker

Success

Precision

KCF [33]

0.418

0.588

Frag [34]

0.408

0.538

VTD [35]

0.407

0.527

MIL [36]

0.393

0.539

OAB [37]

0.389

0.526

SL [17]

0.437

0.606

Ours

0.795

0.890

Figures 12 to 17 illustrate the performance evaluation plots for the three benchmark datasets. Specifically, Figures 12, 14, and 16 present the success plots for OTB-50, OTB-100, and TC-128 respectively, while Figures 13, 15, and 17 show the corresponding precision plots, providing a clear visual comparison of tracking accuracy and precision across these datasets.

Figure 12. Success Plot of OTB-50

Figure 13. Precision Plot of OTB-50

Figure 14. Success plot of OTB-100

Figure 15. Precision Plot of OTB-100

Figure 16. Success plot of TC-128

Figure 17. Precision plot of TC-128

Comparison of Evaluations Under 11 Attributes for OTB50, OTB100 and TempleColor128-Deformations (DEF), Illumination Variations (IV), Background Clutter (BC), In-Plane Rotations (IPR), Fast Motions (FM), Occlusions (OCC), Out of Plane Rotations (OPR), Motion Blurs (MB), Scale Variations (SV), Out of Views (OV) and Low Resolutions (LR).

5. Conclusions and Future Scope

In difficult scenarios with deformations, lighting variations, background clutter, in-plane rotations, fast motions, occlusions, out-of-plane rotations, motion blurs, scale variations, out-of-view instances, and low resolutions, the paper focuses on improving the tracking performance of Siamese Network and TensorFlow. When there are several peaks in the Siamese Network and TensorFlow response map, a more accurate detection network is used to pinpoint the object's position. Furthermore, the model retains accuracy in response to changes in the object's appearance. In addition, a high confidence template update technique is applied to avoid template contamination. The suggested system outperforms Siamese Network and TensorFlow in terms of tracking accuracy and success rates, according to an objective assessment using OTB and TempleColor sequences. Experiments on sample video sequences demonstrate better accuracy and resilience, especially in situations with varying lighting, fast motion, occlusion, and background clutter. Upcoming efforts will concentrate on improving the re-detection network and streamlining real-time performance to enable the engineering situations where Siamese Network based trackers may be used practically. Although it performs competitively on common tracking benchmarks, the suggested hybrid model that combines Siamese Networks and TensorFlow has some drawbacks. Specifically, the models performance may deteriorate in harsh settings like extended occlusion extreme target deformation or low light levels. Because of the limited information about appearance, these situations frequently result in tracking drift or loss of target identity. The use of multi-modal data adaptive appearance models or the integration of temporal attention mechanisms could all be investigated in future research. Adding transformer-based tracking modules or reinforcement learning could also help with occlusion recovery and adaptability to changing scenes.

  References

[1] Ondrašovič, M., Tarábek, P. (2021). Siamese visual object tracking: A survey. IEEE Access, 9: 110149-110172. https://doi.org/10.1109/ACCESS.2021.3101988

[2] Lee, D.H. (2019). One-shot scale and angle estimation for fast visual object tracking. IEEE Access, 7: 55477-55484. https://doi.org/10.1109/ACCESS.2019.2913390

[3] Yan, B., Peng, H., Fu, J., Wang, D., Lu, H. (2021). Learning spatio-temporal transformer for visual tracking. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, pp. 10448-10457. https://doi.org/10.1109/ICCV48922.2021.01028

[4] Husman, M.A., Albattah, W., Abidin, Z.Z., Mustafah, Y.M., Kadir, K., Habib, S., Islam, M., Khan, S. (2021). Unmanned aerial vehicles for crowd monitoring and analysis. Electronics, 10(23): 2974. https://doi.org/10.3390/electronics10232974

[5] Uke, N.J., Thool, R.C. (2014). Motion tracking system in video based on extensive feature set. The Imaging Science Journal, 62(2): 63-72. https://doi.org/10.1179/1743131X13Y.0000000052

[6] Zhong, W., Lu, H., Yang, M.H. (2012). Robust object tracking via sparsity-based collaborative model. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, USA, pp. 1838-1845. https://doi.org/10.1109/CVPR.2012.6247882

[7] Li, H., Wu, S., Huang, S., Lam, K.M., Xing, X. (2019). Deep motion-appearance convolutions for robust visual tracking. IEEE Access, 7: 180451-180466. https://doi.org/10.1109/ACCESS.2019.2958405

[8] Zheng, L., Tang, M., Wang, J. (2018). Learning robust gaussian process regression for visual tracking. In 2018 Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), Stockholm, Sweden, pp. 1219-1225. https://doi.org/10.24963/ijcai.2018/170

[9] Danelljan, M., Gool, L.V., Timofte, R. (2020). Probabilistic regression for visual tracking. In 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, pp. 7183-7192. https://doi.org/10.48550/arXiv.2003.12565

[10] Chen, K., Tao, W. (2018). Convolutional regression for visual tracking. IEEE TIP, 27(7): 3611-3620. https://doi.org/10.1109/TIP.2018.2819362

[11] Milan, A., Rezatofighi, S.H., Dick, A., Reid, I., Schindler, K. (2017). Online multi-target tracking using recurrent neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 31(1): 4225-4232. https://doi.org/10.1609/aaai.v31i1.11194 

[12] Yun, S., Choi, J., Yoo, Y., Yun, K., Choi, J.Y. (2018). Action-driven visual object tracking with deep reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 29(6): 2239-2252. https://doi.org/10.1109/TNNLS.2018.2801826

[13] Henriques, J.F., Caseiro, R., Martins, P., Batista, J. (2014). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3): 583-596. https://doi.org/10.1109/TPAMI.2014.2345390

[14] Zheng, L., Tang, M., Wang, J. (2018). Learning robust gaussian process regression for visual tracking. In 2018 Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), Stockholm, Sweden, pp. 1219-1225. https://doi.org/10.24963/ijcai.2018/170

[15] Li, X., Liu, Q., Fan, N., Zhou, Z., He, Z., Jing, X.Y. (2020). Dual-regression model for visual tracking. Neural Networks, 132: 364-374. https://doi.org/10.1016/j.neunet.2020.09.011

[16] Zhang, B., Zhang, X., Qi, J. (2015). Support vector regression learning based uncalibrated visual servoing control for 3D motion tracking. In 2015 34th Chinese Control Conference (CCC), Hangzhou, China, pp. 8208-8213. https://doi.org/10.1109/ChiCC.2015.7260942

[17] Zhang, J.W., Wang, H., Zhang, H.L., Wang, J.C., Miao, M.E., Wang, J.D. (2022). Tracking method of online target-aware via shrinkage loss. International Journal of Innovative Computing, Information and Control, 18(5): 1395-1411. https://doi.org/10.24507/ijicic.18.05.1395

[18] Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M. (2015). Learning spatially regularized correlation filters for visual tracking. In 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, pp. 4310-4318. https://doi.org/10.1109/ICCV.2015.490

[19] Zheng, K., Zhang, Z., Qiu, C. (2022). A fast adaptive multi-scale kernel correlation filter tracker for rigid object. Sensors, 22(20): 7812. https://doi.org/10.3390/s22207812

[20] Yadav, S. (2021). Occlusion aware kernel correlation filter tracker using RGB-D. arXiv preprint arXiv:2105.12161. https://doi.org/10.48550/arXiv.2105.12161

[21] Maharani, D.A., Machbub, C., Yulianti, L., Rusmin, P.H. (2023). Deep features fusion for KCF-based moving object tracking. Journal of Big Data, 10(1): 136. https://doi.org/10.1186/s40537-023-00813-5

[22] Yang, J., Tang, W., Ding, Z. (2021). Long-term target tracking of UAVs based on kernelized correlation filter. Mathematics, 9(23): 3006. https://doi.org/10.3390/math9233006

[23] Oner, M.U., Kye-Jet, J.M.S., Lee, H.K., Sung, W.K. (2020). Studying the effect of mil pooling filters on mil tasks. arXiv preprint arXiv:2006.01561. https://arxiv.org/abs/2006.01561

[24] Xiong, D., Lu, H., Yu, Q., Xiao, J., Han, W., Zheng, Z. (2020). Parallel tracking and detection for long-term object tracking. International Journal of Advanced Robotic Systems, 17(2): 1729881420902577. https://doi.org/10.1177/1729881420902577

[25] Patil, R.R., Vaidya, O.S., Phade, G.M., Gandhe, S.T. (2020). Qualified scrutiny for real-time object tracking framework. International Journal on Emerging Technologies, 11(3): 313-319.  

[26] Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H. (2017). End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 2805-2813. https://doi.org/10.1109/CVPR.2017.298

[27] Wang, M., Liu, Y., Huang, Z. (2017). Large margin object tracking with circulant feature maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 4021-4029. https://doi.org/10.1109/CVPR.2017.428

[28] Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H. (2016). Fully-convolutional siamese networks for object tracking. In 2016 European Conference on Computer Vision Workshops (ECCV 2016 Workshops), Amsterdam, Netherlands, pp. 850-865. https://doi.org/10.1007/978-3-319-48881-3_56

[29] Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., Torr, P.H. (2016). Staple: Complementary learners for real-time tracking. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, pp. 1401-1409. https://doi.org/10.1109/CVPR.2016.156

[30] Ma, C., Yang, X., Zhang, C., Yang, M.H. (2015). Long-term correlation tracking. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, USA, pp. 5388-5396. https://doi.org/10.1109/CVPR.2015.7299177

[31] Li, Y., Zhu, J. (2015). A scale adaptive kernel correlation filter tracker with feature integration. In 2016 European Conference on Computer Vision Workshops (ECCV 2016 Workshops), Amsterdam, the Netherlands, pp. 254-265. https://doi.org/10.1007/978-3-319-16181-5_18

[32] Danelljan, M., Häger, G., Khan, F., Felsberg, M. (2014). Accurate scale estimation for robust visual tracking. In 2014 British Machine Vision Conference (BMVC 2014), Nottingham, UK, pp. 1-12. https://doi.org/10.5244/C.28.65

[33] Henriques, J.F., Caseiro, R., Martins, P., Batista, J. (2014). High-speed tracking with kernelized correlation filters. TPAMI, 37(3): 583-596. https://doi.org/10.1109/TPAMI.2014.2345390

[34] Adam, A., Rivlin, E., Shimshoni, I. (2006). Robust fragments-based tracking using the integral histogram. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), New York, USA, pp. 798-805. https://doi.org/10.1109/CVPR.2006.256

[35] Kwon, J., Lee, K.M. (2010). Visual tracking decomposition. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, USA, pp. 1269-1276. https://doi.org/10.1109/CVPR.2010.5539821

[36] Babenko, B., Yang, M.H., Belongie, S. (2010). Robust object tracking with online multiple instance learning. TPAMI, 33(8): 1619-1632. https://doi.org/10.1109/TPAMI.2010.226

[37] Grabner, H., Grabner, M., Bischof, H. (2006). Real-time tracking via on-line boosting. BMVC Proceedings, 1(5): 6. https://doi.org/10.5244/C.20.6