CLOCS FOR SENSOR FUSION IN AUTONOMOUS DRIVING

8 min readMar 4, 2021

CLOCS → ( Camera Lidar Object Candidates)

CLOCS fusion provides the low-complexity multi modal fusion that significantly increases the performance of single modality detectors.

This is used for the fusion of 2D detector with 3D detector which finally gives the confidence scores for 3D detections as the output. Here the 2D detections is from CAMERA and 3D detections is from LIDAR.

GitHub Repository for CLOCS : https://github.com/pangsu0613/CLOCs

CLOCS Paper Link : https://arxiv.org/pdf/2009.00784.pdf

Evaluation on the challenging KITTI object detection . KITTI data set consists of the 7500 training images for the which the ground truth is available and 7500 testing images for which the the ground truth is not available.

For CLOCS

2D detections for camera (One of these architecture is used generally)

CASCADE-RCNN
RRC
MS-CNN

will be used and 3D detections (One of these architecture is used generally)

SECOND
Point Pillars
PV-RCNN are used.

CLOCS fusion of CASCADE-RCNN and PV-RCNN is ranked number 4 on KITTI 3D object detection leader-board and number 1 on 2D detection leader board.

Fusion of CASCADED-RCNN and SECOND-RCNN is called CLOCS_SecCas.

CLOCS FUSION NETWORK ARCHITECTURE.

We have the 2D detections and 3D detections from the camera detection model and lidar detection model. We want to convert them into set of joint detections (In layman terms we want to mix them) which results in sparse tensor the blue box the predicted scores.
As indicated by the red arrow after joining we are using the 2D convolution. This is used to process the non-empty detections (Meaning to say that detections that are filled in the sparse tensor ) represented by blue box in the image. ( How this blue box is filled we see in the later part )
After performing the 2D convolution operation this is mapped to desired learning targets and the probability score for detection (After max pooling).
If you see the above , before max pooling we will use the indices of the non empty elements. Which means to say that we will only care about the detections that have IOU (greater than threshold) between 2D and 3D box.
There is one more thing to understand here is that in the sparse input tensor T (represented by the blue box) we will fill it if the IOU is non zero for iₜₕ index and jₜₕ index.
So if we have 2 (2D detections) and 3 (3D detections) the number of elements in the sparse tensor matrix will be 6.And Shape of the tensor will be 6 x 4.
There is one important point to understand here which is ignored often. →If we have the 3D detection but if we don’t have the corresponding 2D detection. We will consider that detection while encoding into sparse matrix we will take both IOU and confidence score as -1.
→ In case of any architecture what we have to mainly look into is what is the input and what is the output. Here the input is 2D detections and 3D detections. Output is confidence scores of those 3D detections.
For k 2D detection candidates in one image can be defined as follows:

P 2D ={p 2D 1 , p 2D 2 , …p 2D k }, p 2D i ={[xi1, yi1, xi2, yi2] , s2D i }

For n 3D detection candidates in one LiDAR scan can be defined as follows: P 3D ={p 3D 1 , p 3D 2 , …p 3D n }, p 3D i ={[hi , wi , li , xi , yi , zi , θi ] , s3D i }

Detailed Explaination Of Architecture :-

Sparse Input Tensor Representation:

Ti,j = {IoUi,j , s2D i , s3D j , dj}
where IoUi,j is the IoU between ith 2D detection and jth projected 3D detection, s 2D i and s 3D j are the confident scores for ith 2D detection and jth 3D detection respectively. dj represents the normalized distance between the jth 3D bounding box and the LiDAR in xy plane.
Elements Ti,j with zero IoU are eliminated as they are geometrically inconsistent.
We wanna encode all 2D and 3D detections into joint detections.
The output of 2D object detectors are 2D bounding boxes in image plane and corresponding confidence scores.
The output of 3D object detectors are bounding boxes in lidar coordinates and confidence scores.
In kitti data-set we will have the 7 digit vector for 3D bounding box. We will not perform the NMS in this algorithm because correct information will get suppressed because of limited info. from sensor modality.

3D detection using Multi-Modal Fusion.

Why we are going for clocs finally giving the 3D bboxes when we already have the architectures like AVOD and MV3D ??

Because the performance of these methodologies is worser than LIDAR based only methods. First , transforming the raw point cloud to bird eye view image loses spatial information . Second , the crop and resize operation used to fuse feature vectors may destroy the feature structure from each sensor.
Camera images is high resolution dense data and Lidar point cloud is low- resolution sparse data. Fusing two different data structure is not trivial.
In the process forcing the feature vectors from 2D images and 3D point clouds to have same size, equal length. Then concatenating,aggregating and averaging will leads to inaccuracy.

2. Using MMF (Multi Task Multi Sensor Fusion) uses continuous convolution to build dense BEV feature maps and do point wise feature fusion with dense feature maps.

OUR AIM : Take 2D detections and 3D detections and finally provide the 3D detections and predicted scores for 3D detections.

We will fuse the detection candidates. There are 3 general categories of fusion.

(1) Early Fusion :- Advantage of cross modal interaction. But data differences between modalities like alignment , representation and sparsity are not properly addressed before sending into the network.

(2) Deep Fusion :- Separate channels for different modalities while combining the features during processing. Approach is complex.

(3) Late Fusion :- Advantage of training on own sensor data.Only fusion step require the jointed and aligned data. In late fusion we already have the detections from the both camera and lidar ( Taking example as our sensor fusion) . It is better to avoid the NMS stage which can suppress the true detections and keep the thresholds as low as possible.

CLOCS FUSION :-

For a given image and lidar data we take detections and scores. Fusing requires association for which we apply geometric consistency and semantic consistency.
Geometric Consistency : If the object correctly detected by both 2D and 3D , we will have identical bounding box in image plane. Small errors in pose can lead to reduced overlap for which we perform IOU between 2D detection and projected corners of 3D detection.
Semantic Consistency : Detectors may give multiple categories of objects. But we associate detections of same category.

Network Architecture :-

We are taking 2D detections and 3D detections as the input and represent them int the form of sparse tensor denoted by T. We have the 2D detections in the image plane and corresponding confidence scores. We also have 3D detections in LIDAR coordinates and confidence scores.
Encoding of 3D detections in kitti dataset will be like 7 digit vector with 3D dimension and 3D location and rotation(yaw angle).
This vector consists of [height,width,length,x,y,z angle].
For k 2D detections and n 3D detections , we build k*n*4 tensor T. 4 corresponds to [ IOU information , 2D detection information , 3D detection information , normalised distance information j 3D detections and lidar in xy plane].
What we finally get from the CLOCS architecture is the confidence scores for the 3D detections of lidar.
If we have the 3D detection and but we don’t have the corresponding 2D detection set the IOU and the confidence score value as -1. But don’t ignore the 3D detection if we don’t have the corresponding 2D detection.
But we ignore the 2D detection if we don’t have the corresponding 3D detection with IOU greater than threshold.
Elements in Tᵢⱼ with zero IOU are eliminated as they are geometrically inconsistent.
Only non empty values are used by the fusion network. Tensor is passed through the 2D convolution and we already have the non empty indices stored in the cache.
Based on these non empty indices we will reconstruct the tensor T . Later perform the max pooling and squeezing which finally give the predicted scores for the 3D detections.
So finally outputs from the CLOCS will be just the predicted scores for the each 3D detection.

Loss :

Uses Categorical Cross Entropy for the multi class classification of the objects which is modified by the focal loss to overcome the limitation of the class imbalance.

KITTI LEADER BOARD

Performance comparison of object detection with state-of-the-art camera-LiDAR fusion methods on car class of KITTI test set (new 40 recall positions metric). CLOCs fusion improves the baselines and outperforms other state-of-the-art fusion-based detectors. 3D and bird’s eye view detection are evaluated by Average Precision (AP)

Key Points to be noted :

What are the contributions made by CLOCS different from current state of art methods. ??

CLOCS uses any pair of 2D and 3D detections without requiring the pre-training and can be directly employed.
Automatically learns the geometric and semantic consistencies between 2D and 3D .
It is fast and adds less than 3ms latency on desktop level GPU.
Achieved highest rank among all the fusion methods in KITTI leader-board.

2. Why we are not striking deeply into 3D detection . why we are fusing the 2D and 3D detection ??

Because for the point cloud techniques the object detection performance at longer distance is relatively poor.

3. What is the difference between SECOND architecture and POINT-PILLARS architecture used for 3D detection ??

SECOND uses sparse 3D CNNS as the encoding is sparse in case of SECOND. whereas in case of the POINT-PILLARS encoding is done vertical columns(PILLARS) uses 2D CNNS to perform the 3D object detection.

In case of any further queries , Setup make comment here or directly ping machinelearningdeeplearning628@gmail.com.

references : https://arxiv.org/pdf/2009.00784.pdf

CLOCS FOR SENSOR FUSION IN AUTONOMOUS DRIVING

CLOCS → ( Camera Lidar Object Candidates)

Written by Gandham Vignesh Babu

No responses yet