A Compare Research of Two Different Point Clouds 3D Object Detection Methods

ABSTRACT


INTRODUCTION
The development of Intelligent Connected Vehicle (ICV) and Cooperative Vehicle Infrastructure System (CVIS) is expected to greatly improve the safety and the efficiency of our traffic system [1].Among many related technologies, environmental perception plays a very important and fundamental role for both ICV and CVIS.Benefit from its strong environmental adaptability and the capability to acquire accurate distance information, LiDAR has become one of the most popular sensors for the perception task [2,3].Quite different from camera, which is arguably the most popular sensor in this field, LiDAR provides 3D point clouds rather than 2D image.So, detecting objects from 3D point clouds is becoming an important research.
Traditional approaches for the recognition task in 3D points cloud generally adopt a multi-stage frame, which clusters the points and then recognizes the clusters.On the other hand, many cutting-edge researches proposed a one-stage frame, which employs deep learning technology to recognize objects in an end-to-end manner [4].One-stage methods are generally more accurate, while needing a very complex classifier.On the other hand, multi-stage methods generally generate proposals in the first stage, and then recognize objects in the second stage.For the proposal generation stage, they mostly generate proposals through methods, such as sliding window method, anchor-based method, etc.For the recognizing objects stage, pattern match technology and deep learning technology are widely used.
Although one-stage methods have shown great accuracy on performance, they greatly rely on the quality of sample data and show quite different characteristics compared with the multi-stage methods.So in this paper, the performance of the two different frames using two algorithms, namely MSCS-Pointpillars and AF3D, was compared.
For the one-stage method, MSCS-Pointpillars (Multi-Scale Channel Spatial Attention Pointpillars) method was proposed, which reduces the information loss, based on the Pointpillars method.For the multi-stage methods, a flexible multi-stage 3D point clouds object detection method AF3D (Accurate and Flexible 3D Object Detection Algorithm) was proposed, where the points are first clustered and then detected by a neural network.
The rest of the paper is organized as follows.Section 2 briefly reviews relevant research about 3D object detection, which will be the basis for our later work.Section 3 introduces the details of our methods and the verification on the KITTI 3D Object dataset.In section 4, the experiments on our vehicle were described and the character of the two frames of detection based on the experiment results was discussed.Concluding the paper and propose the future work in Section 5.

End to end frame methods
Deep learning-based methods generally adopt end-to-end frames, which can be divided into two categories according to the input form of points.The two categories include methods based on structured grid and methods based on raw point clouds.The first category of methods generally feeds the deep learning network different types of grid representation while the latter category feeds the raw points directly.These grid representations generally include voxel methods, multi-view projection image and higher dimensional lattices.
The methods based on structured grid are generally efficient while they inevitably lose information.The detection methods based on raw points can retain more details, while they generally consume more computation resources.

Grid-based methods
Typical grid-based methods may first map 3D point clouds into voxel space, and then use 3D convolution or sparse convolution for feature extraction.The classical models include VoxelNet [5], SECOND [6], Pointpillars [7], etc. Zhou and Tuzel [5] proposed the VoxelNet algorithm, which divides the raw point clouds into several voxel grids, uses VFE (Voxel Feature Encoding) layer for local point clouds feature extraction, and 3D convolution to obtains global features while a RPN (Region Proposal Network) is used to detect and locate the objects.Y. Yan etc. [6] improved the VoxelNet and proposed a spatially sparse convolutional network called SECOND which reduced the consumption of computation resources and improved the training speed.Lang et al. [7] proposed a very efficient algorithm called Pointpillars, which divides the point clouds into a certain number of pillars, and then converts them into pseudo-images [8].However, Pointpillars compresses the raw points into single scale pillar division, which may still cause information loss.Some other methods project the original point clouds into an aerial view or a frontal view, so that the unordered point clouds in 3D space would be converted into a regular image, which is more convenient to apply image recognition algorithms like Yolo, CNN etc.The classical models of this type include YOLO3D [9], Complex-YOLO [10], etc.

Raw point clouds-based methods
Classic raw point clouds-based detection methods include Pointnet [11], Pointnet++ [12], 3D-SSD [13], Point-GNN [14], SPG [15], etc.In 2017, Qi et al. [11] proposed Pointnet, which directly feeds a deep learning network with raw point clouds.Then Pointnet extracts features from point clouds through a number of shared MLPs, and performs maximum pooling in each dimension of the feature map to obtain global features.Paigwar et al. [16] improved Pointnet with a visual attention mechanism.Qi et al. [12] proposed an improved version called Pointnet++, which presented a hierarchical feature learning architecture for better abstraction of multi-scale features.Yang et al. [13] proposed 3D-SSD algorithm which uses a feature distance-based farthest point sampling method to exclude background points, while semantic information has also been considered.In 2021, Xu et al. [15] proposed a so-called SPG method, which generates semantic point sets and then merges them with the original point clouds to obtain an enhanced point clouds.After that, a point clouds detector is employed to obtain the detection results.

Multistep frame methods
Multistep frame methods generally include a points extraction or cluster step before the recognizing step.Common clustering methods include five categories, that are partition clustering, distance clustering, density clustering, hierarchical clustering and grid clustering [17].K-means clustering algorithm may be one of the most typical partition clustering algorithms, which iteratively finds the centers of different clusters and the points nearest to the centers respectively.However, K-means algorithm requires too many manuallyadjusted parameters and performs poorly regarding complex shape objects.Recent progress of the partition clustering algorithm may be found in the study [18].Euclidean clustering may be one of the most typical distance clustering algorithms, which clusters the nearest points into one class.As a very typical density cluster algorithm, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) treats the closely-linked points as one cluster.Some recent progress of this category may be found in the studies [19,20].In recent years, hybrid clustering algorithms, which combine several different cluster algorithms, have gradually attracted more attention.Wang et al. [21] proposed the SPSO-FCM algorithm which combines the improved particle swarm optimization algorithm with the fuzzy C-means algorithm to cluster the points.The algorithm has fast convergence speed and clear boundary region segmentation.For scenes with complex profile and large number of point clouds, it can get better segmentation results.There will be more hybrid clustering algorithms in the future.
For detection, traditional methods have developed various algorithms based on the 3D geometric features of the points.Tan [22] has selected the outer contour and the reflection intensity of the cluster to build a feature vector.Using that feature vector, SVM (Support Vector Machine) has then been employed to detect the cluster.This algorithm has obtained good accuracy at the cost of efficiency.Yang [23] has clustered the point clouds based on the spatial scale, reflection intensity, overall distribution and other geometrical characteristics.Then the clusters were classified by SVM as well.In 2019, Shi et al. [24] migrated the image segmentation network RCNN to point clouds data and proposed a multistage 3D object detection network called Point RCNN.The author has further proposed an improved version called PV-RCNN [25] which uses efficient 3D voxel convolution and point clouds convolution to estimate the confidence and spatial location of objects.In 2021, Li et al. [26] proposed an efficient multi-stage detection network LiDAR-RCNN.This method uses a voxelization method to remove background points, and then builds a Pointnet-based network to extract features, classify the object and locate it.

Pointpillars
As a popular 3D point clouds detection method, Pointpillars [6] algorithm is composed of several modules that includes: pillars feature extraction module, feature extraction module and single shot multiple frame detection module.In order, an improved Pointpillars called MSCS-Pointpillars, which employs a multi-scale pillars module and an attention mechanism to enhance its ability to detect objects in different sizes was proposed.As shown in Figure 1, the original singlescale pillars with three different scales pillars in the pillars feature extraction module was replaced, and the attention mechanism has been applied.⁄ ,  2 ⁄ ) respectively.By applying a convolution operation, the feature map whose size is (, 2, 2) would be compressed to a smaller image whose size is ( 2 ⁄ , , ).Similarly, the feature map whose size is (, , ) would be transformed to a smaller size ( 2 ⁄ , , ) while the feature map whose size is (,  2 ⁄ ,  2 ⁄ ) would be transformed to ( 2 ⁄ , , ).After that, the three different feature maps with three scales would be fused in the channel direction.So that the size of the fused feature map would be (3 2 ⁄ , , ).After that the fused feature map is further convolved to a smaller feature map , whose size is (, , ).

Improved feature extraction based on attention module
As shown in Figure 3, introduce a CBAM (Convolutional Block Attention Module) to improve the detection effect of the network.This makes the network pay more attention on the important information helpful for recognition task.Firstly, apply global max-pooling and average-pooling operations on the feature map  obtained by the multi-scale pillar feature extraction module.The pooling results are then subjected to a matrix add operation after MLP to generate the channel weight   ϵ  * 1 * 1 .Then   ϵ  * 1 * 1 are multiplied to the fused feature map  to obtain a channel attention feature map   .Secondly,   is subjected to global max-pooling and averagepooling operations by the channel as well.The pooling results are spliced and then subjected to a 7*7 convolution operation to generate another spatial weight   ϵ 1 *  *  .Then the spatial weights   ϵ 1 *  *  and   are multiplied to obtain a weighted feature map   with channel spatial weight information, and then   is sent to the feature extraction network for higher level representations.Specifically, it involves feature extraction of feature maps through twodimensional convolution.

Loss function
The output of the MSCS-Pointpillars network mainly includes the category of the object and the parameters of the 3D bounding box.The 3D bounding box can be represented by parameters (, , , , , ℎ, ) , where (, , ) represents the coordinates of the center of the 3D bounding box.(, , ℎ) represents the width, length, and height of the 3D bounding box respectively. represents the rotation angle around the axis (the -axis is the axis perpendicular to the ground).So, the loss function is designed as shown in Eq. ( 1), which includes 3D bounding box regression loss, classification loss, orientation loss.The 3D bounding box regression loss function is shown in Eq. ( 2), which mainly includes the 3D box position regression loss, the size regression loss and the orientation regression loss, as shown in Eqs. ( 3), ( 4) and ( 5 The classification loss ℓ  is shown in Eq. ( 7): where,   represents the possibility of a certain category.According to reference paper Lang et al. [7], set  to 0.25, and  to 2. Since Eq. ( 5) cannot distinguish the 3D box flipping 0° and 180°, ℓ  uses the Softmax function to learn the orientation of the 3D bounding box in discrete directions [6].

AF3D
Our work combined traditional object detection algorithms and built a PFC-Net network based on Pointnet, proposing a flexible and accurate 3D object detection algorithm AF3D.The AF3D algorithm framework is shown in Figure 4 [27] to extract and remove the ground points was employed.After that, the non-ground point clouds were segmented and clustered using DBSCAN algorithm.
For the DBSCAN algorithm, the clustering neighborhood radius (Eps) was set to 0.45 meters, the Euclidean distance metric was adopted, and the minimum number of points required for a core point (MinPts) was set to 10.The nonground point clouds clustering effect obtained by using DBSCAN is shown in Figure 5.The first column shows the point clouds clustering effect, and the second shows the original images accordingly.It was found that all the point clouds are properly clustered.

PFC-Net network training 1) KITTI 3D Object Dataset
Here KITTI 3D Object dataset was used to train PFC-Net and to test its performance.As one of the most commonly used public dataset, KITTI 3D Object dataset [28] was chosen to verify the performance of PFC-Net network.KITTI 3D Object dataset consists of 7481 training samples and 7518 testing samples, mainly including various types of cars, pedestrians, and cyclists.The KITTI 3D Object dataset covers multiple scenes such as highways, suburbs, and urban roads.Each frame of data includes both point clouds and their corresponding RGB images, label files, etc.According to the cutoff value and the occlusion scale, the dataset samples can be categorized into easy, medium and difficult levels, which also represent the difficulty of detection as shown in Table 1. 2

) Dataset reconstruction
The original KITTI 3D Object dataset contains point clouds of the whole scene rather than the clusters, which cannot be directly used for PFC-Net.Therefore, the KITTI 3D Object dataset was reconstructed.First, according to the label files, the point clouds data belonging to three categories, namely cars, cyclists, and pedestrians, is extracted.Second-ly, the data augmentation method to balance the number of samples in each category for better training effect was used.The specific steps are as follows: a) Object point clouds extraction For each frame of point clouds in the KITTI training set, the label categories of "Van", "Car", "Truck", and "Tram" are unified into "Car".So that only "Car", "Pedestrian", "Cyclist" are kept.Figure 7 c) respectively.In addition, due to the characteristics of lidar, the spatial information of objects may seriously be missing due to too sparse points.Therefore, three threshold values for the number of points belonging to the category of "Car", the "Pedestrian", the "Cyclist" that are 200, 50, 50 were stipulated respectively.Therefore, apply data augmentation to the categories with small proportion in training set.The samples in the categories with small proportion were translated and randomly rotated angle around the -axis of the lidar coordinate system.On the other hand, the Weighted Random Sampler method, which selects training data based on the weight of each category was also used, to solve the problem of imbalanced sample proportions.The proportion of samples after data augmentation and weighted random sampling is shown in Table 2.
According to the study [29], only objects located within 40m ahead, would be properly detected when the speed of the car reaches 80 Km/h.So only obstacles located in that range would be classified by the algorithms in the following verification.If the number of the inputted points is less than 1024, it will be augmented to 1024 using zero padding.If the number of the inputted points is bigger than 1024, it will be down sampled to 1024 using farthest points sampling technique.
The reconstructed KITTI 3D Object dataset is divided into training set and test set according to the ratio of 8:2.The crossentropy loss function shown as Eq. ( 8) was adopted for PFC-Net.
where,  represents the number of categories,  is the sample index,  is the category index and   represents a sign function where   equal to 1 only if the predicted category of the network is consistent with the label, other wise   equal to 0.   represents the predicted confidence of the sample  belonging to the category .
The Adam (Adaptive Moment Estimation) optimizer with an initial learning rate of 1 * 10 −3 is adopted, while the learning rate is attenuated by 0.8 times every 15 cycles.The epoch is set to 150, while the batch size is set to 128.
The maximum number of pillars, denoted as , is set to 12000, and the maximum number of points in the pillar is set to 100.
For better comparison, all the samples were classified into easy, medium and hard, according to three overlap thresholds, respectively.For cars, the easy, medium and hard overlap thresholds are set to 0.7, 0.7, 0.7 times IoU, respectively.The overlap thresholds are set to 0.5, 0.5, 0.5 times IoU for the cyclist categories and the pedestrian categories.The training loss curve is shown in Figure 11.In the legend of the figure, "loss_cls" represents the classification loss, "loss_bbox" represents the loss of the 3D box, and loss_dir represents the orientation loss.
Our work uses the average precision as the criteria.The verification results on the KITTI 3D Object data set are shown in Table 3 where the performance is displayed according to different difficulty degrees.It can be found that compared to Pointpillars, MSCS-Pointpillars has been improved on almost all categories.For the car category, the maximum height is set to 2.5m, the maximum length is set to 8m, and the minimum height threshold is 1.5m.Due to the severe occlusion of some cars in the dataset, the minimum length threshold for car size is not specified.When the height and length of the cluster is greater than the maximum height and length threshold of the car category or when the height of the cluster is less than the minimum height threshold, it can be discarded.For the cyclist category, the maximum height threshold is set to 2m, the maximum length threshold is 2.5m, and the minimum height threshold is 1.25m.For the pedestrian category, the maximum height threshold is set to 2m and the minimum height threshold is 1.25m.
Similarly, the verification results of AF3D on the KITTI 3D Object dataset are shown in Table 4.
The visualization results of AF3D on KITTI 3D Object dataset are shown in Figure 12.

EXPERIMENT
In this section, the algorithm was run on our own vehicle to conduct an experimental comparison between the MSCS-Pointpillars and AF3D algorithms, then the advantages and disadvantages of the two methods were analyzed comprehensively.

Experiment platform
The hardware equipment on the car includes HESAI Technology Pandar 64-line lidar, monocular, camera, RTK GPS and a laptop, as shown in Figure 13.

Experiment data
Driver drove the experimental car along a road in our campus.Part of the scene is shown in Figure 14.Because the speed of the car is relatively low, point clouds data is captured in every 2s.So that our experiment dataset includes 200 frames of point clouds data.3D bounding boxes are employed to label cars, cyclists, and pedestrians.
Figure 15 shows some labeling result where red represents cars, green represents pedestrians, and purple represents cyclists.
After trained on KITTI dataset, the MSCS-Pointpillars and AF3D were tested on our experiment dataset neither with retraining nor refining, and the detection accuracy is shown in Table 5.From the verification results on the KITTI 3D Object dataset in Sections 3.3.1 and 3.3.2, it can be seen that both methods have good detection capability and the detection accuracy of cars is higher, followed by the accuracy of cyclists.Pedestrians have obtained the lowest detection accuracy.This is because of two reasons.Pedestrians generally produce least number of points compared with the other two categories with similar distance.
It can also be found that the detection accuracy of MSCS-Pointpillars on the KITTI 3D Object dataset outperforms that of AF3D.Because MSCS-Pointpillars can make use of the context information of the whole point clouds, while AF3D can only use the cluster information for recognition.
The training time and the computation cost of AF3D algorithm are much less than those of MSCS-Pointpillars.This is because in AF3D, only PFC-Net, which is much simpler than MSCS-Pointpillars, needs to be trained and the DBSCAN in AF3D is highly efficient.
Without retraining or refining, our experiment utilizes unfamiliar scenes for both MSCS-Pointpillars and AF3D algorithms.The detection accuracy of MSCS-Pointpillars decreased significantly while in contrast, the detection accuracy of AF3D algorithm only suffered a small decrease.In our experiment, the detection accuracy of AF3D on cars, pedestrians and cyclists is 15.35%, 8.37% and 15.1% higher than that of MSCS-Pointpillars, respectively.This is because AF3D employs a traditional clustering algorithm to cluster the point cloud before recognizing it.Moreover, the clusters for each category would only be slightly changed in different scenes, which makes it much easier to be recognized in an unfamiliar scenario.In contrast, MSCS-Pointpillars greatly relied on the context information of the whole scenario for better accuracy performance, while this context information would be greatly changed in unfamiliar scenes.

CONCLUSION
This paper has proposed two different approaches for object detection task in 3D point clouds.MSCS-Pointpillars is a type of Deep learning algorithm that adopts an end-to-end frame to fully utilize context information, resulting high accuracy.AF3D is a combination of Deep learning technology and traditional two-step frame.By comparing the performance of these two algorithms, results found that the traditional twostep frame can help on reducing the computation resource request and enhance the adaptability to unfamiliar scenes.Meanwhile end-to-end frame can propose better detection performance while it requires much more computational resources and greatly relies on the universality of the training sample set.On the other way, the accuracy of the two-step frame algorithm greatly relies on the clustering accuracy, while the high recognition accuracy for each cluster would be comparatively easy to be achieved.

Figure 1 .FFigure 2 .Figure 3 .
Figure 1.Overview of MSCS-Pointpillars network ) where,   represents the number of positive samples generated boxes, ℓ  represents the 3D box regression loss function, ℓ  represents the classification loss function, and ℓ  represents the object orientation loss function.  ,   ,   are the corresponding weights.In this paper, use the loss function parameters of Pointpillars network for reference, and set   = 2,   = 2,   = 2.

3. 2 . 2
Stage 2: Point clouds Classification Network PFC-Net Because point clouds have been clustered, a very simple classification network named PFC-Net to classify each cluster was designed.The network structure of PFC-Net is shown in Figure 6.The clusters were inputted as a tensor with size (, 3), where  represents the number of points inputted and 3 represents the channel information of each point.Feature extraction is performed by a multiple weight-shared MLP modules.So that a feature map  1 represented by a tensor with size (, 1024) is obtained.After that, a global feature map, with size (1,1024), is obtained by a Max-Pooling layer.By concatenating both the global feature and the feature map  1 , a feature map  2 , which is represented by a tensor with size (, 2048), is obtained. 2 contains both the detailed features and the global features of the cluster.The feature map  2 is then passed through two weight-shared MLPs and a Max-Pooling layer for further feature extraction to form a new feature map  3 represented by a tensor whose size is (1,1024).After that  3 is sent to three fully connected layers for classification result.

Figure 5 .F3Figure 6 .
Figure 5. Point clouds clustering results based on DBSCAN algorithm (a) and (b) show an extraction example where a pedestrian denoted by green points has been extracted from the original points clouds.The visualization results of the extracted clusters belonging to the three categories are shown in Figure 8 (a), (b) and (

3 )
Verification of PFC-Net The verification was running on a workstation equipped with Intel i7 10700KF and NVIDIA RTX 3080 graphics card.The software environment includes Ubuntu 18.04 LTS, CUDA 11.1, cuDNN 8.0.5, and Python 3.7.

Figure 9 .
Figure 9. PFC-Net training loss curve: the vertical axis represents the loss function value, and the horizontal axis represents the number of training epochs The training loss curve of PFC-Net is shown in Figure 9, where the vertical axis represents the loss value, and the horizontal axis represents the number of training epochs.It shows that the training mostly fits after the 120th training epoch.The average accuracy curve is shown in Figure 10, where the vertical axis represents the overall accuracy, and the horizontal axis represents the number of training epochs.The red curve represents the accuracy of classification on the training set.while the blue curve represents the accuracy of classification on the test set.It can be found that after being well-trained, the average accuracy of point clouds classification on the training set is as high as 99.21%, and the average accuracy of point clouds classification on the test set reaches 95.78%.

Figure 10 .
Figure 10.PFC-Net classification average accuracy curve: the vertical axis represents the overall accuracy, and the horizontal axis represents the number of training cycles 3.3 Verification of two methods on KITTI 3.3.1 Verification of MSCS-Pointpillars The performance of MSCS-Pointpillars was also verified on the KITTI 3D Object Dataset.According to the practice by Chen et al. [30], redividing the 7481 samples into training and testing set samples, of which the number of training set samples is 3712 and the number of test set samples is 3769.In the training of MSCS-Pointpillars, the maximum number of epoch iterations is set to 160, Adam optimizer's initial learning rate is set to 2 * 10 −4 , while the learning rate is attenuated by 0.8 times every 15 cycles.The region of interest is intercepted by passthrough filtering, and the values are shown in Eq. (9).

Figure 11 .
Figure 11.MSCS-Pointpillars training Loss curve: the vertical axis represents the loss value, and the horizontal axis represents the epochs

Figure 12 .
Figure 12.The visualization results of AF3D on KITTI 3D Object dataset: the first row of images represents the extraction of ground points (ground is shown in green, non-ground point clouds is shown in gray); the second row of images is a visualization of the clustering effect; the third row of images is a classification of the clusters (cars is shown in red, blue for cyclists is shown in blue, and pedestrians is shown in green); the fourth row shows the original RGB image corresponding to the point clouds, respectively

Figure 16 .
Figure 16.Visualization of detection results of AF3D and MSCS-Pointpillars in experiment.The first column shows the detection result of the AF3D, and the second column shows the detection result of MSCS-Pointpillars.The third column is the label result.The red point clouds represents that the targets have been detected, while the black represents non target point clouds ) respectively.  ,   ,   ,   ,   , ℎ  ,   represent the truth value.The   ,   ,   ,   ,   , ℎ  ,   represent the predicted value by the network.

Table 1 .
KITTI dataset object detection level division

Table 2 .
The proportion of each category before and after data augmentation (%)

Table 3 .
Comparison with Pointpillars on the KITTI 3D object test set (%)