AVOD (Aggregated View Object Detection) IN SENSOR FUSION

4 min readMar 3, 2021

Fusion of Lidar point clouds and RGB images. Both Lidar point clouds and RGB images are used for the region proposal network and second stage detector network.

Region proposal network :-

Performing the multi modal fusion on high resolution feature maps and used for the second stage detection. 3D bounding box regression and category classification.
This proposed architecture achieved the state of art results on the kitti data set.

What is the advantage of the 3-D object detection over the 2-D object detection ??

Extending the prediction to 3-D we can capture the size , position and orientation of the object in the real world.

New implementations in AVOD compared to the previous architectures :-

AVOD architecture is inspired by the feature pyramid networks. It proposes the usage of the novel feature extractor. It produces the feature maps from the lidar point clouds and the RGB images.
AVOD architecture uses the feature fusion region proposal network with multiple modalities to produce higher recall for the less dominant classes.
It uses the 1*1 convolutions at RPN stage which increases the computational speed.

It performs the DEEP FUSION SCHEME that combines the information from the feature crops to produce the detection outputs.

Before the AVOD introduces the previous methods like MV3D(Multi View 3D Detection for autonomous driving ) were using the proposals from the LIDAR point clouds. But this proposed region proposal network will not work for the smaller object instances in the bird eye view image.

Why this will not work ??

Because these proposals are down sampled by the convolutional feature extractors , smaller object instances will occupy the fraction of the pixel in final feature map.

How AVOD overcomes this limitation ??

This architecture fuses the full resolution feature corps from both image and BEV of point cloud as the input to Region Proposal Network. Generation of the region proposals will independent of the orientation.

Architecture :

(a) Generation of the feature maps from the point clouds and the images.

Generation of the 6 channel BEV from the voxel grid view representation of the point cloud. In this the resolution will be 0.1M.
Point Cloud is cropped at [-40,40] and [0,70] metres.
First 5 channels of the BEV are encoded with the maximum height of the points from the each grid. This is generated from the 5 equal slices [0,2.5 ] along the Z-axis.
Last channel which is channel 6 will contain the point density information.

(b) The Feature Extractor :

Feature extractor consists of the two segments : and encoder and decoder.
With input image of shape M x N X D produces an feature map with shape M/8 x N/8 x D.
Down-sampling by 8 will result in small classes occupy less than one pixel.
Inspired by the FPN(Feature Pyramid Network )we use the Bottom-Up Decoder that learns to up-sample the feature map back to the same size as the input.
Final feature map is high resolution and representational power shared by RPN (Region Proposal Network) and second stage network.

(C ) Multi-modal Fusion Region Proposal Network.

For the region proposals we use the prior boxes where the probability of object presence is higher known as ANCHORS.
ANCHORS has parameters of centroid which acts as the mid point for the anchor denoted by tₓ, tᵧ,t𝓏 and axis aligned dimensions dₓ, dᵧ,d𝓏.
How these anchors are generated ?? We generate grid of anchors with tₓ, tᵧ, pairs sampled in 0.5 metres resolution in BEV. Meaning to say for every 0.5 m in real world we will use the set of anchors to detect the object.
We are using tₓ, tᵧ for anchor generation and t𝓏 is determined by the height above the ground plane.
Similar to the other algorithms we obtain the coordinates of these anchors using k-means clustering.

How we get the feature crops from feature map and perform resize operation to encode into fixed length vector and process to second stage detection part ??

We have an 3D anchor and we will project it on BEV which comprises of the 6 channels + image feature map.
After extraction of feature crops from both , we resize then bilinear to 3 X 3 to obtain fixed length feature vectors.
Extraction of feature crops abide by aspect ratio of projected anchor in both the views which is more reliable than faster-rcnn.

AVOD (Aggregated View Object Detection) IN SENSOR FUSION

Written by Gandham Vignesh Babu

No responses yet