Instance Segmentation and Fine-Grained Contour Extraction of Basketball Players from Videos under Complex Occlusion Scenarios

Delong Jia

Department of Physical Education, Shandong Technology and Business University, Yantai 264005, China

Corresponding Author Email:

19153509328@163.com

Received:

12 November 2025

Revised:

27 February 2026

Accepted:

9 March 2026

Available online:

30 April 2026

| Citation

ts_43.02_38.pdf

OPEN ACCESS

Abstract:

In basketball game videos, challenges such as frequent limb bending, dense multi-player occlusion, and complex court lighting conditions severely degrade the performance of conventional instance segmentation methods, often resulting in coarse mask boundaries, distorted contours in occluded regions, and insufficient representation accuracy. To address these issues, this paper proposes a unified framework that integrates dynamic deformable feature extraction, parametric shape encoding, and continuous distance field optimization for precise basketball player instance segmentation and contour refinement. First, a boundary-enhanced multi-scale feature extraction network based on Dynamic Snake Convolution (DSC) is constructed to accurately capture slender and curved limb structures while enabling bidirectional fusion of semantic and edge features. Second, departing from the traditional binary discrete mask prediction paradigm, a two-stage continuous contour prediction mechanism is introduced, which combines Elliptic Fourier Descriptors for global shape encoding with a Signed Distance Function (SDF) field for local boundary refinement, achieving compact shape representation and sub-pixel-level contour restoration. Furthermore, a depth-hierarchy-aware copy-paste data augmentation strategy with spatial constraints is designed to simulate realistic occlusion scenarios, and a level-set evolution-based active contour algorithm is incorporated for post-processing refinement in heavily occluded areas. Finally, a multi-task joint weighted loss function is formulated to enable collaborative optimization across multiple branches. Extensive experiments conducted on both a self-constructed basketball occlusion dataset and publicly available sports vision benchmarks demonstrate that the proposed method significantly outperforms state-of-the-art instance segmentation approaches in terms of segmentation accuracy and contour fitting quality under severe occlusion conditions. The proposed approach effectively accomplishes accurate and smooth player instance segmentation, providing reliable visual technical support for practical applications such as intelligent tactical analysis, quantitative sports pose assessment, and athlete behavior recognition.

Keywords:

basketball video image segmentation, complex occlusion, Dynamic Snake Convolution, Elliptic Fourier Descriptors, Signed Distance Function, contour refinement, multi-scale feature fusion

1. Introduction

With the rapid iteration of computer vision technologies [1] and their deep integration with the smart sports industry, intelligent sports event analysis has become a core driving force for the digital transformation of the sports industry [2, 3]. As one of the most globally popular competitive sports, basketball game analysis is gradually upgrading from traditional manual statistics to an intelligent and refined direction, and the core supporting role of precise visual perception of athletes is becoming increasingly prominent. However, the dynamic and complex environment of basketball courts brings many intractable technical challenges to athlete instance segmentation and contour extraction [4, 5]: athletes' limbs exhibit slender and curved non-rigid shapes, and actions such as torso twisting and limb stretching during movement lead to constant changes in target morphology; in high-intensity confrontation scenarios, densely arranged players are prone to large-scale limb cross-occlusion, with some areas even experiencing complete occlusion, resulting in incomplete target features; fluctuations in court lighting intensity, motion artifacts, and blurring effects caused by fast motion further exacerbate image quality fluctuations; meanwhile, there are significant scale differences between athletes, referees, and coaches on the court, coupled with rapid displacement, further increasing the difficulty of feature extraction and contour localization. As the underlying core technology of intelligent basketball game analysis, the performance of instance segmentation and precise contour extraction directly determines the accuracy and reliability of upper-level applications such as player trajectory tracking, quantitative action scoring, tactical decomposition, and intelligent event commentary. Therefore, conducting research on athlete instance segmentation and contour refinement for complex occlusion scenarios in basketball game videos holds significant practical necessity [6-8]. From a theoretical perspective, this study can improve the theoretical system of continuous contour representation for non-rigid human targets under complex dynamic occlusion scenarios, break through the accuracy bottleneck of traditional discrete mask segmentation, and establish a new visual segmentation paradigm integrating deformable convolution feature extraction and parametric shape priors [9, 10], enriching the research content and technical paths in the field of sports-specific visual image processing. From an application perspective, the research results can be directly implemented in professional basketball game intelligent analysis systems [11], providing coaching teams with accurate tactical optimization bases; they can be applied to campus basketball teaching correction [12], assisting students in standardizing movement postures; they can also be extended to scenarios such as public fitness posture assessment and sports big data intelligent analysis [13, 14], possessing extremely high engineering implementation value and industrial promotion potential.

Currently, scholars at home and abroad have conducted extensive research on instance segmentation, sports human segmentation, contour extraction, and occlusion optimization, forming a series of research results. However, specific research targeting complex occlusion scenarios in basketball game videos still suffers from many common defects, making it difficult to meet actual application demands. Mainstream instance segmentation models rely on pixel-level binary mask output; their inherent discrete nature leads to obvious stepped aliasing on segmentation edges, making sub-pixel level contour extraction impossible [15, 16]. In areas where athletes' limbs cross and occlude, pixel misclassification is prone to occur, leading to contour boundary distortion. Traditional deformable convolutions adopt an independent sampling point offset mechanism lacking continuous path constraints, making them unsuitable for adapting to the complex geometric shapes of basketball players' limb bending and torso twisting. Their integrity in extracting long-distance limb features is insufficient, leading to feature fracture problems. Existing contour extraction methods mostly focus on local edge detection [17, 18], lacking global human shape prior constraints. In areas where occlusion leads to missing contours, contour deformation and limb fracture are highly likely to occur, making it impossible to achieve accurate restoration of complete contours. General image data augmentation strategies do not consider the spatial depth logic of the court and cannot simulate the real physical occlusion relationship of "near occluding far." Synthetic training samples differ greatly from actual court scenes, resulting in insufficient generalization capability of models in complex occlusion scenarios. Furthermore, existing segmentation frameworks lack dedicated loss constraints for sports human targets [19]; the optimization weight distribution for edge features and shape features is unreasonable, and the collaborative fitting effect of multi-task branches is poor, making it difficult to balance segmentation accuracy and contour refinement [20]. The existence of these problems severely restricts the application and promotion of athlete instance segmentation and contour extraction technologies in intelligent basketball game analysis, necessitating the proposal of a targeted solution.

Aiming at the above research deficiencies, this paper conducts research on athlete instance segmentation and contour refinement for complex occlusion scenarios in basketball game videos. The core innovations and contributions are as follows: designing a boundary-enhanced multi-scale Dynamic Snake Convolution (DSC) feature extraction backbone network, introducing cumulative offset constraints to match athlete limb features, and realizing deep fusion of semantic and edge features; proposing a dual-stage continuous contour prediction framework combining Elliptic Fourier global shape encoding and Signed Distance Function (SDF) local refinement to achieve compact contour representation and completion in occluded areas; constructing a position-aware depth-constrained copy-paste data augmentation method to improve model occlusion robustness; designing a multi-constraint fusion active contour post-processing mechanism to optimize contour refinement; building a multi-task joint weighted loss function to achieve collaborative optimization of all branch tasks.

The organizational structure of the subsequent chapters of this paper is as follows: Chapter 2 elaborates in detail on the athlete instance segmentation and contour refinement method proposed in this paper, including the specific design of the feature extraction network, the dual-stage contour prediction mechanism, the occlusion-aware training strategy, and the multi-task joint optimization loss function; Chapter 3 verifies the effectiveness and superiority of the proposed method through multiple sets of comparative experiments and ablation experiments, and provides a detailed analysis and discussion of the experimental results; Chapter 4 summarizes the full text's research work, objectively analyzes the limitations of the research, and looks forward to future research directions.

2. Athlete Segmentation and Contour Refinement Method for Basketball Occlusion Scenarios

2.1 Boundary-enhanced multi-scale Dynamic Snake Convolution feature extraction backbone network

This chapter is guided by the characteristics of non-rigid deformation and dense occlusion of athletes on basketball courts, constructing a feature extraction backbone architecture adapted to the human body limb morphology. The network uses Residual Network with 50 Layers (ResNet-50) and Residual Network with 101 Layers (ResNet-101) as the basic feature encoding base, reconstructs and replaces all standard convolutional structures within the network conv2x to conv5x levels, and introduces DSC to complete the global upgrade of the feature sampling method. The overall architecture follows the hierarchical design idea of basic feature extraction, multi-scale feature fusion, and edge perception enhancement, relying on the residual structure to complete the layer-by-layer mapping from shallow image textures to deep semantics, while combining the Feature Pyramid Network to construct a multi-resolution feature representation system. The network sets four feature output levels, corresponding to feature maps with different downsampling rates. Each level independently completes feature encoding and information interaction, and adds bypass branches to construct a parallel representation structure for semantics and edges. The model uniformly receives game video frames with a resolution of 800×1280 as input, and outputs multi-scale feature maps with both semantic expression and edge details after backbone network encoding, providing a stable feature input basis for subsequent instance localization and contour fitting tasks. Figure 1 shows the overall framework diagram of athlete segmentation and contour extraction for basketball occlusion scenarios.

3. Experimental Results and Performance Analysis

3.1 Overall experimental setup

The experiment takes athlete instance segmentation and contour refinement under complex occlusion scenarios in basketball game videos as the core objective, verifying the effectiveness and superiority of the proposed method through joint datasets. The dataset consists of a self-built basketball game complex occlusion dataset and public sports human vision datasets. The self-built dataset contains 5000 basketball game images, covering different lighting intensities, occlusion degrees, and athlete motion postures, with each image finely annotated for athlete instances, segmentation masks, and SDF ground truth. The public datasets selected are COCO-Sports and Sports-1M, filtering relevant samples of basketball events and supplementing contour annotations to ensure data diversity and annotation consistency. The dataset is divided into training, validation, and test sets at a ratio of 8:1:1. During the training process, online data augmentation is used to generate synthetic occlusion samples to improve the model's generalization ability.

Three core evaluation metrics are adopted to comprehensively quantify model performance: instance segmentation metrics include Average Precision at IoU threshold 0.5 (AP50), Average Precision at IoU threshold 0.75 (AP75), and mean Intersection over Union (mIoU), measuring segmentation accuracy and global pixel matching degree under different IoU thresholds, respectively; contour refinement metrics include Average Boundary Distance Error and Sub-pixel Contour Fitting Error, quantifying the accuracy and smoothness of contour extraction; model efficiency metrics adopt Single-image Inference Frames Per Second (FPS) to evaluate algorithm real-time performance. The software and hardware environment is uniformly configured: hardware uses a single NVIDIA A100 GPU with 32GB memory; software is based on the PyTorch 1.12 deep learning framework, equipped with Python 3.8, OpenCV 4.5, and CUDA 11.6, ensuring the reproducibility of experimental results.

3.2 Core ablation experiments

To verify the effectiveness of each core module of the proposed method, 5 groups of ablation experiments are designed. The settings and quantitative results of each group are as follows. All data are average results on the test set, where the occlusion degree is divided into mild (occlusion ratio ≤ 30%), moderate (30% < occlusion ratio ≤ 60%), and severe (occlusion ratio > 60%).

3.2.1 Feature extraction backbone network ablation experiment

Experiments compare the performance of different feature extraction backbones to verify the role of DSC and the Semantic-Edge Fusion module. The experimental results are shown in Table 1.

It can be seen from Table 1 that as the occlusion degree increases, the segmentation accuracy of various backbone networks shows a downward trend, and contour extraction errors gradually increase, while the proposed complete feature backbone performs best under all occlusion degrees. Compared with the native ResNet-50, the proposed backbone improves AP50 by 9.5 percentage points, mIoU by 10.1 percentage points, and reduces the average boundary distance error by 1.12 pixels in severe occlusion scenarios, indicating that DSC can effectively capture the bending limb features of athletes and solve the problem of traditional convolution sampling fracture. Compared with the traditional deformable convolution backbone, the proposed backbone further improves edge feature extraction accuracy through semantic-edge bidirectional fusion, with AP75 increasing by 5.6 percentage points and sub-pixel contour fitting error decreasing by 0.23 pixels in moderate occlusion scenarios. The performance of the multi-scale network without an edge branch is slightly better than that of the traditional deformable convolution backbone but lower than that of the proposed complete backbone, verifying the important role of the edge perception branch in boundary detail optimization. Although the FPS of the proposed backbone decreases slightly due to the introduction of the DSC and edge fusion modules, it remains above 25 FPS, meeting real-time inference requirements.

Table 1. Ablation experiment results of feature extraction backbone networks

Backbone Network Type	Occlusion Degree	Average Precision at IoU Threshold 0.5 (AP₅₀) (%)	Average Precision at IoU Threshold 0.75 (AP₇₅) (%)	mean Intersection over Union *(mIoU) (%)*	Average Boundary Distance Error (Pixel)	Sub-Pixel Contour Fitness Error (Pixel)	Frames Per Second *(FPS) (Frame/s)*
Native Residual Network with 50 Layers (ResNet-50)	Mild	82.5	71.3	78.6	2.13	0.89	28.7
	Moderate	75.8	63.2	71.5	2.87	1.24	28.5
	Severe	68.3	54.7	64.2	3.79	1.68	28.6
Traditional Deformable Convolution Backbone	Mild	85.3	74.8	81.4	1.87	0.76	26.3
	Moderate	79.6	67.5	75.2	2.51	1.08	26.1
	Severe	73.2	59.4	68.7	3.32	1.45	26.2
Multi-scale Network without Edge Branch	Mild	86.7	76.2	82.8	1.72	0.71	27.5
	Moderate	81.2	69.3	77.1	2.35	1.01	27.3
	Severe	75.1	61.8	70.9	3.08	1.32	27.4
Proposed Complete Feature Backbone	Mild	89.7	79.5	85.9	1.38	0.57	25.8
	Moderate	84.5	73.1	80.6	1.96	0.83	25.6
	Severe	77.8	65.2	74.3	2.67	1.12	25.7

3.2.2 Dual-stage contour representation framework ablation experiment

Experiments compare the performance of different contour representation methods to verify the necessity of the dual-stage framework combining Elliptic Fourier global encoding and SDF local refinement. The experimental results are shown in Table 2.

Table 2. Ablation experiment results of the dual-stage contour representation framework

Contour Representation Method	Occlusion Degree	Average Precision at IoU Threshold 0.5 (AP₅₀) (%)	Average Precision at IoU Threshold 0.75 (AP₇₅) (%)	Mean Intersection over Union *(mIoU) (%)*	Average Boundary Distance Error (Pixel)	Sub-pixel Contour Fitness Error (Pixel)	Contour Distortion Rate (%)
Traditional Binary Mask Prediction	Mild	85.1	74.6	81.2	1.89	0.92	3.7
	Moderate	78.9	66.8	74.7	2.63	1.35	7.2
	Severe	71.5	58.3	67.9	3.58	1.76	12.5
Single Signed Distance Function (SDF) Field Prediction	Mild	86.5	76.3	82.7	1.64	0.78	2.9
	Moderate	80.3	68.7	76.5	2.37	1.14	5.8
	Severe	73.8	60.5	69.8	3.12	1.48	9.6
Elliptic Fourier Shape Prediction Only	Mild	84.3	73.5	80.1	1.96	0.87	4.1
	Moderate	77.6	65.4	73.2	2.71	1.28	7.9
	Severe	70.2	57.1	66.5	3.65	1.69	13.2
Proposed Dual-Stage Continuous Contour Prediction	Mild	89.7	79.5	85.9	1.38	0.57	1.8
	Moderate	84.5	73.1	80.6	1.96	0.83	4.3
	Severe	77.8	65.2	74.3	2.67	1.12	7.8

Data from Table 2 show that the proposed dual-stage continuous contour prediction framework significantly outperforms the other three contour representation methods in all indicators, especially in severe occlusion scenarios. Due to its discrete nature, traditional binary mask prediction yields rough contour edges, the highest contour distortion rate reaching 12.5% under severe occlusion, and larger boundary errors. Single SDF field prediction can achieve continuous contour extraction but lacks global shape constraints, making it prone to contour offset in occluded areas, with a contour distortion rate still reaching 9.6%. Elliptic Fourier shape prediction focuses on the global topological structure, but the local detail fitting accuracy is insufficient, resulting in boundary errors and distortion rates higher than those of single SDF field prediction. The proposed dual-stage framework combines global shape encoding and local distance field refinement. In severe occlusion scenarios, compared with traditional binary mask prediction, the average boundary distance error is reduced by 0.91 pixels, the sub-pixel contour fitting error is reduced by 0.64 pixels, and the contour distortion rate is reduced by 4.7 percentage points. This verifies that the dual-stage continuous representation mode can effectively solve the problems of insufficient discrete mask accuracy and poor robustness of single continuous representation, achieving collaborative optimization of global structure and local details.

3.2.3 Comparative experiment of occlusion-aware data augmentation strategies

Experiments compare the impact of different data augmentation strategies on model performance to verify the effectiveness of the position-aware depth-constrained copy-paste augmentation method. The experimental results are shown in Table 3.

Table 3. Comparative experimental results of occlusion-aware data augmentation strategies

Data Augmentation Strategy	Occlusion Degree	Average Precision at IoU Threshold 0.5 (AP₅₀) (%)	Average Precision at IoU Threshold 0.75 (AP₇₅) (%)	Mean Intersection over Union *(mIoU) (%)*	Average Boundary Distance Error (Pixel)	Sub-pixel Contour Fitness Error (Pixel)
No Data Augmentation	Mild	86.2	75.8	82.3	1.67	0.74
	Moderate	79.8	68.4	76.1	2.41	1.09
	Severe	72.1	59.7	68.5	3.25	1.49
Conventional Geometric Transformation Augmentation	Mild	87.5	77.1	83.6	1.53	0.68
	Moderate	81.5	70.2	77.8	2.24	1.01
	Severe	74.3	61.9	70.7	3.01	1.36
Random Copy-Paste Augmentation	Mild	88.3	78	84.5	1.45	0.63
	Moderate	82.7	71.5	79.1	2.12	0.94
	Severe	75.6	63.2	72.1	2.85	1.27
Proposed Depth-Hierarchy Occlusion Augmentation	Mild	89.7	79.5	85.9	1.38	0.57
	Moderate	84.5	73.1	80.6	1.96	0.83
	Severe	77.8	65.2	74.3	2.67	1.12

It can be seen from Table 3 that various data augmentation strategies can improve model performance, among which the proposed depth-hierarchy occlusion augmentation strategy achieves the best results. Without any data augmentation strategy, the model performs worst in severe occlusion scenarios, with AP50 being only 72.1%, indicating that insufficient occlusion scenarios in training samples will lead to weak model generalization ability. Conventional geometric transformation augmentation can only improve the model's adaptability to posture and scale changes, with limited optimization effects on occlusion scenarios; although random copy-paste augmentation can increase the number of occlusion samples, it lacks depth constraints, causing synthetic samples to not conform to real occlusion logic, resulting in only a 3.5 percentage point increase in AP50 under severe occlusion scenarios. The proposed depth-hierarchy occlusion augmentation generates synthetic samples that conform to the physical law of "near occluding far" through depth estimation and geometric correction. In severe occlusion scenarios, compared with no data augmentation, AP50 increases by 5.7 percentage points, mIoU increases by 5.8 percentage points, and the average boundary distance error decreases by 0.58 pixels, significantly improving the segmentation accuracy and contour extraction capability of the model in dense mutual occlusion scenarios, verifying the rationality and effectiveness of this augmentation strategy.

3.2.4 Effectiveness verification experiment of active contour post-processing module

Two sets of controlled experiments with and without post-processing were set up to quantify the optimization effect of the level set evolution active contour post-processing module on contour refinement extraction. The experimental results are shown in Table 4.

Table 4. Effectiveness verification experimental results of the active contour post-processing module

Experimental Setting	Occlusion Degree	Average Boundary Distance Error (Pixel)	Sub-pixel Contour Fitness Error (Pixel)	Contour Smoothness (Pixel⁻¹)	Inference Time (ms/Frame)	Frames Per Second *(FPS) (Frame/s)*
Without Post-Processing	Mild	1.62	0.71	0.38	38.2	26.2
	Moderate	2.28	0.98	0.31	38.1	26.3
	Severe	3.05	1.35	0.25	38.3	26.1
With Post-Processing	Mild	1.38	0.57	0.49	44.5	22.5
	Moderate	1.96	0.83	0.42	44.7	22.4
	Severe	2.67	1.12	0.36	44.6	22.4

Data from Table 4 show that after introducing the active contour post-processing module, contour refinement indicators are significantly optimized, while inference efficiency remains within a reasonable range. Without post-processing, predicted contours are prone to aliasing and local concavity-convexity in severely occluded areas, with contour smoothness being only 0.25 and the average boundary distance error reaching 3.05 pixels. After introducing the post-processing module, through energy function constraints and level set evolution, contour smoothness increases by 44%, and the average boundary distance error and sub-pixel contour fitting error decrease by 0.38 pixels and 0.23 pixels, respectively, with the optimization effect being more obvious in severe occlusion scenarios. Although the post-processing module increases inference time by 6.3 ms per frame and the FPS drops to 22.5 FPS, it still meets the real-time analysis requirements for basketball game videos. Moreover, through block-wise evolution optimization, the increase in time consumption has been effectively controlled, achieving a balance between accuracy and efficiency.

To verify the instance boundary recovery capability of the proposed method when facing complex occlusion, non-rigid posture changes, and lighting disturbances in basketball game videos, this paper conducts a visual analysis of the step-by-step processing results in typical scenarios. As can be seen from the results in Figure 5, the original input contains interference factors such as large-area overlapping of player bodies, slender limb bending caused by layup movements, and edge weakening caused by strong light shadows. Traditional segmentation results based on local texture or discrete masks are prone to adhesion, boundary fracture, and local contour distortion at occlusion junctions. In contrast, the proposed method enhances continuous responses along the bending direction of limbs through DSC in the feature extraction stage, forming more stable boundary high-response regions after semantic-edge fusion. In the dual-stage continuous contour prediction stage, global shape encoding provides reasonable human topological constraints for occluded parts, and local distance field regression further ensures the continuity and smoothness of boundary transitions. After active contour post-processing, the aliasing, protrusions, and unnatural depressions in the initial zero level set are significantly corrected. The final segmentation mask and bright green fine contour can still remain clearly separated in multi-player close-range areas. Detailed magnified results further show that contours relying solely on SDF are prone to inward or outward concavity in occluded areas, while after introducing Fourier shape priors, the contour can be smoothly completed along the natural structure of the human body.

Figure 5. Visual comparison chart of processing effects at each stage of the proposed method

3.2.5 Horizontal comparison experiment with mainstream advanced algorithms

Current mainstream instance segmentation and contour extraction algorithms were selected for comparison. Comprehensive performance comparisons were completed on a unified test set to verify the comprehensive superiority of the proposed method. The experimental results are shown in Table 5.

Table 5. Horizontal comparison experimental results of mainstream advanced algorithms

Algorithm Type	Occlusion Degree	Average Precision at IoU Threshold 0.5 (AP₅₀) (%)	Average Precision at IoU Threshold 0.75 (AP₇₅) (%)	Mean Intersection over Union *(mIoU) (%)*	Average Boundary Distance Error (Pixel)	Sub-pixel Contour Fitness Error (Pixel)
Mask Region-based Convolutional Neural Network (R-CNN)	Mild	84.7	73.9	80.5	1.92	0.95
	Moderate	77.5	65.6	73	2.68	1.38
	Severe	70.1	56.8	66.2	3.62	1.81
Sparse Instance Activation for Real-Time Instance Segmentation (SparseInst)	Mild	86.3	75.7	82.2	1.71	0.82
	Moderate	79.6	68.3	75.9	2.43	1.17
	Severe	72.8	59.9	68.9	3.28	1.53
Human Contour Transformer	Mild	87.6	77.4	83.8	1.56	0.73
	Moderate	81.2	70.5	77.6	2.25	1.02
	Severe	74.9	62.7	71.5	2.96	1.31
Generic Signed Distance Function (SDF) Segmentation Algorithm	Mild	86.8	76.5	82.9	1.65	0.79
	Moderate	80.1	68.9	76.7	2.38	1.13
	Severe	73.5	61.2	70.2	3.15	1.46
Proposed Method	Mild	89.7	79.5	85.9	1.38	0.57
	Moderate	84.5	73.1	80.6	1.96	0.83
	Severe	77.8	65.2	74.3	2.67	1.12

It can be seen from Table 5 that the proposed method achieves the best overall performance among all comparison algorithms, especially showing significant advantages in severe occlusion scenarios. As a traditional instance segmentation algorithm, Mask Region-based Convolutional Neural Network (R-CNN) relies on binary mask output, resulting in low contour extraction accuracy, with AP50 being only 70.1% under severe occlusion and large boundary errors. The Sparse Instance Activation for Real-Time Instance Segmentation (SparseInst) algorithm has the fastest inference speed but lacks sufficient segmentation accuracy and contour refinement in severe occlusion scenarios. Although the Human Contour Transformer algorithm focuses on contour extraction, it lacks optimization designs for occlusion scenarios, resulting in a sub-pixel fitting error of 1.31 pixels under severe occlusion. The generic SDF segmentation algorithm achieves continuous contour extraction but lacks dedicated shape prior constraints for basketball players, leading to limited performance improvement. Through the collaborative optimization of core modules, the proposed method achieves AP50 of 89.7%, 84.5%, and 77.8% in mild, moderate, and severe occlusion scenarios, respectively, which are 5.0, 7.0, and 7.7 percentage points higher than Mask R-CNN. Under severe occlusion, the average boundary distance error is reduced by 0.48 pixels and the sub-pixel contour fitting error is reduced by 0.34 pixels compared to the generic SDF segmentation algorithm. Although the FPS of the proposed method is slightly lower than that of Mask R-CNN and SparseInst, it still meets real-time inference requirements and demonstrates the best comprehensive performance in terms of accuracy and efficiency, verifying the applicability and superiority of the proposed method in complex basketball occlusion scenarios.

3.3 Model inference efficiency and engineering practicality analysis

To verify the engineering application value of the proposed method, a disassembly analysis of the inference time consumption of each functional module of the model was conducted. The results are shown in Figure 6.

Figure 6. Disassembly of inference time consumption for each module of the model

It can be seen from Figure 6 that the feature extraction backbone network and the dual-stage contour prediction network are the main sources of inference time consumption, accounting for 42.0% and 34.4% of the total time consumption, respectively. The active contour post-processing module accounts for 14.2% of the time consumption, and its time consumption has been controlled within a reasonable range through block-wise evolution optimization; other modules account for only 9.4% of the time consumption, having a relatively small impact on the overall inference efficiency. The overall single-frame inference time is 44.5 ms, and the FPS reaches 22.5 FPS, which can meet the real-time intelligent analysis requirements for basketball game videos (usually 25 FPS).

Analysis of the model lightweight transformation space indicates that the feature extraction backbone network can further compress parameters through pruning, quantization, and other methods; the DSC module can adopt a lightweight separable convolution structure; the dual-stage contour prediction network can optimize the number of Transformer decoder layers and attention heads. It is estimated that the total inference time consumption can be reduced by more than 20%, and the FPS can be increased to more than 28 FPS, further enhancing engineering practicality.

In summary, under the premise of ensuring high-precision segmentation and contour refinement extraction, the proposed method possesses good real-time inference performance and has considerable room for lightweight transformation. It can meet the actual engineering application requirements of professional basketball game intelligent analysis and sports posture assessment, and has high industrial promotion value.

4. Conclusion and Future Work

This paper focuses on the core issue of athlete instance segmentation and contour refinement under complex occlusion scenarios in basketball game videos, systematically carrying out theoretical research and experimental verification. Aiming at the inherent difficulties such as limb bending deformation, dense multi-player mutual occlusion, and light-shadow interference on the court, a complete technical framework is constructed and comprehensive performance verification is completed. Taking feature extraction optimization, continuous contour representation, occlusion robustness improvement, and post-processing refinement as the core threads, the research designs a boundary-enhanced multi-scale DSC backbone network to achieve accurate capture of bending limb features and bidirectional fusion of semantic and edge features; proposes a dual-stage contour prediction mechanism combining Elliptic Fourier global shape encoding and SDF local refinement, breaking free from the accuracy limitations of discrete masks to achieve sub-pixel level contour extraction; constructs a position-aware depth-constrained data augmentation strategy and a level set evolution active contour post-processing module to improve model occlusion robustness and contour refinement, respectively; designs a multi-task joint weighted loss function to achieve collaborative optimization of all branch tasks. Experimental results show that the proposed method exhibits excellent performance on both self-built datasets and public datasets. Under severe occlusion scenarios, segmentation accuracy and contour fitting accuracy are significantly improved compared to mainstream algorithms, providing reliable visual support for practical applications such as intelligent basketball game analysis, and improving the theoretical system of continuous contour representation for non-rigid human targets under complex dynamic occlusion scenarios.

Although the proposed method has achieved good results in the tasks of segmentation and contour extraction in complex basketball occlusion scenarios, certain limitations still exist. In extreme full-occlusion scenarios, when key limb areas of athletes are completely occluded and no effective visible features exist, the constraint effect of Elliptic Fourier shape priors weakens significantly, leading to a decrease in the rationality and accuracy of contour completion; facing motion-blurred frames caused by high-speed movement, image edge features are severely weakened, and the feature capture capability of DSC is limited, thereby affecting contour fitting accuracy; in large-scale court scenes, distant athlete targets are small in scale and sparse in feature information, and the existing feature extraction backbone network has insufficient feature representation capability for such small-scale targets, resulting in room for further improvement in segmentation and contour extraction accuracy. These issues point out directions for improvement in subsequent research.

Aiming at the limitations of this study and combining the development needs of sports vision intelligence, future research and optimization will be carried out from four aspects. First, integrate inter-frame optical flow temporal information to construct a temporally continuous contour tracking and extraction framework, utilizing inter-frame motion correlation to constrain contour evolution, thereby improving the continuity and stability of athlete contour extraction in long-sequence basketball game videos. Second, design a lightweight DSC structure, adopting techniques such as parameter pruning, quantization, and separable convolution to compress the overall model, reducing computational overhead, and realizing model deployment applications on mobile terminals and edge devices on the court. Third, integrate human pose skeleton prior knowledge, incorporating skeleton keypoint constraints into the contour completion process to strengthen the reasonable reconstruction capability of limb contours in extreme occlusion scenarios. Fourth, expand the applicable scope of the algorithm, optimizing the feature extraction and contour representation modules according to the scene characteristics of other multi-player competitive sports such as football and volleyball, to achieve universal adaptation of the algorithm and promote the large-scale application of sports vision intelligence technology.

References

[1] Wong, K.W. (1992). Machine vision, robot vision, computer vision, and close-range photogrammetry. Photogrammetric Engineering and Remote Sensing, 58(8): 1197-1198.

[2] Senior, A., Hampapur, A., Tian, Y., Brown, L., Pankanti, S., Bolle, R. (2006). Appearance models for occlusion handling. Image and Vision Computing, 24(11): 1233-1243. https://doi.org/10.1016/j.imavis.2005.06.007

[3] Baerg, A. (2017). Big data, sport, and the digital divide: Theorizing how athletes might respond to big data monitoring. Journal of Sport and Social Issues, 41(1): 3-20. https://doi.org/10.1177/0193723516673409

[4] Hagara, M., Stojanović, R., Kubinec, P., Ondráček, O. (2017). Localization of moving edge with sub-pixel accuracy in 1-D images and its FPGA implementation. Microprocessors and Microsystems, 51: 1-7. https://doi.org/10.1016/j.micpro.2017.04.004

[5] Wang, Y., Zhang, N., Yan, H., Zuo, M., Liu, C. (2017). Using local edge pattern descriptors for edge detection. International Journal of Pattern Recognition and Artificial Intelligence, 32(3): 1850006. https://doi.org/10.1142/s0218001418500064

[6] Zhu, H.J., Fan, H.H., Shu, Z.Q., Yu, Q., Zhao, X.R., Gan, P.Z. (2019). Edge detection with chroma components of video frame based on local autocorrelation. IEEE Access, 7: 48543-48550. https://doi.org/10.1109/ACCESS.2019.2910605

[7] Monezi, L.A., Calderani Junior, A., Mercadante, L.A., Duarte, L.T., Misuta, M.S. (2020). A video-based framework for automatic 3D localization of multiple basketball players: A combinatorial optimization approach. Frontiers in Bioengineering and Biotechnology, 8: 286. https://doi.org/10.3389/fbioe.2020.00286

[8] Nakada, M., Zhou, T., Chen, H., Lakshmipathy, A., Terzopoulos, D. (2020). Deep learning of neuromuscular and visuomotor control of a biomimetic simulated humanoid. IEEE Robotics and Automation Letters, 5(3): 3952-3959. https://doi.org/10.1109/lra.2020.2972829

[9] Feng, Y., Liu, X. (2021). Application of video processing technology based on diffusion equation model in basketball analysis. Advances in Mathematical Physics, 2021(1): 7522973. https://doi.org/10.1155/2021/7522973

[10] Dong, X. (2021). Physical training information system of college sports based on big data mobile terminal. Mobile Information Systems, 2021(1): 4109794. https://doi.org/10.1155/2021/4109794

[11] Tan, X. (2023). Enhanced sports predictions: A comprehensive analysis of the role and performance of predictive analytics in the sports sector. Wireless Personal Communications, 132(3): 1613-1636. https://doi.org/10.1007/s11277-023-10585-z

[12] Magaz-González, A.M., García-Tascón, M., Sahelices-Pinto, C., Gallardo, A.M., Pérez, J.C.G. (2023). Technology and digital transformation for the structural reform of the sports industry: Building the roadmap. Proceedings of the Institution of Mechanical Engineers, Part P: Journal of Sports Engineering and Technology, 238(2): 150-158. https://doi.org/10.1177/17543371231197323

[13] Li, H., Huang, X. (2024). Intelligent dance motion evaluation: An evaluation method based on keyframe acquisition according to musical beat features. Sensors, 24(19): 6278. https://doi.org/10.3390/s24196278

[14] Zhou, J., Tian, L. (2024). Design of a mobile big data processing-based sports health evaluation system using graph neural network. IEEE Access, 12: 48997-49006. https://doi.org/10.1109/access.2024.3383929

[15] Wang, H., Chen, T., Wang, Y. (2025). Towards occlusion-aware multi-pedestrian tracking. Applied Sciences, 15(24): 13045. https://doi.org/10.3390/app152413045

[16] Xia, Y., Zhang, L., Guo, T., Jin, Q. (2025). Boundary-aware semantic segmentation of remote sensing images via Segformer and Snake Convolution. Computer Science and Information Systems, 22(3): 991-1010. https://doi.org/10.2298/CSIS250312054X

[17] Xin, W., Wu, Z., Zhu, Q., Bi, T., Li, B., Tian, C. (2025). Dynamic snake convolution neural network for enhanced image super-resolution. Mathematics, 13(15): 2457. https://doi.org/10.3390/math13152457

[18] Han, L., Chen, L., Dong, L. (2026). Study on the impact of digitalization and the energy consumption structure on the green development of sports industry in China. Polish Journal of Environmental Studies, 35(1): 1145-1159. https://doi.org/10.15244/pjoes/197056

[19] Ma, S., Liu, L., Cheng, M., Qin, P., Han, Z., Chen, C., Wang, H. (2026). Visibility-prior guided dual-stream mixture-of-experts for robust facial expression recognition under complex occlusions. Electronics, 15(6): 1230. https://doi.org/10.3390/electronics15061230

[20] Lv, T., Sheng, K., Qiao, L. (2026). A geometry-driven quantitative modeling framework for image-based human motion evaluation: Application to sub-pixel posture analysis and feature attribution. Mathematics, 14(5): 746. https://doi.org/10.3390/math14050746

IJHT
MMEP
ACSM
EJEE
ISI
I2M
JESA
RCMA
RIA
TS
IJSDP
IJSSE
IJDNE
JNMES
IJES
EESRJ
RCES
AMA_A
AMA_B
AMA_C
AMA_D
MMC_A
MMC_B
MMC_C
MMC_D

Username
Password
Remember me

Search form

Instance Segmentation and Fine-Grained Contour Extraction of Basketball Players from Videos under Complex Occlusion Scenarios