Exploration of KITTI Dataset for autonomous driving
Benchmark dataset for 3D object detection recorded in GERMANY.
We have the driving scenarios designed in this dataset which is continous sequence. And there is seperate dataset available for training and testing.
For kitti development kit find this repository:
For kitti data plotting and futher exploration find this repository:
KITTI data is provided in two formats :-
- RAW DATA : This is continous data set scenario.
- In this we have 4 cameras data where the 2 cameras are b/w and 2 are color.
- There is GPS /IMU data .otxts belongs to GPS data.
- We also have the velodyne information frm the lidar.
- We also have the tracklet information
- Coords in lidar coordinate system.
- We have calibration information.
2. RANDOM Dataset (Used for Deep Learning)
- We are only considering the left camera.
- We have the velodyne information.
- We have GT at frame level.
- We also have the calibration information.
- coords in camera coordinate system.
Kitti Sensor Format:
Advantages of doing 3D object detection over 2D object detection :
*. It gives the length of the object.
*. It also gives the orientation information of the object.
*. Driving decisions will be better in 3D information than just 2D information.
Hardware specification
Kitti several sensors including LIDAR, grayscale camera, colour cameras and IMU onboard the vehicle. However, we will only focus on:
- Cam 0: Grayscale camera, left camera of a stereo rig. This is the reference camera
- Cam 2: RGB colour camera, left camera of a stereo rig
- Velo: 64 beams Velodyne laser scanner
Coordinate System: Vehicle facing left, left-hand coordinate system (fig 3)
What data do we have?
Refer to data/readme.txt
for more details.
Lidar point cloud fileid.bin
: 2D array with shape [num_points, 4]
Each point encodes XYZ + reflectance.
Object instance fileid_label.txt
: For each row, the annotation of each object is provided with 15 columns representing certain metadata and 3D box properties in camera coordinates:
type | truncation | visibility | observation angle | xmin | ymin |xmax | ymax | height | width | length | tx | ty | tz | roty
Some instances type
are marked as ‘DontCare’, indicating they are not labelled.
RGB image fileid_image.png
: Image from camera 2
Calibration Parameters
fileid_calib.txt
The calibration parameters are stored in row-major order. It contains the 3x4 projection matrix parameters which describe the mapping of 3D points in the world to 2D points in an image.
The calibration process is explained in [2]. Important things to take note is calibration is done with cam0
as the reference sensor. The laser scanner is registered with respect to the reference camera coordinate system.
Rectification R_ref2rect
has also been considered during calibration to correct for planar alignment between cameras.
It contains the following information:
P_rect[i]
: projective transformation from rectified reference camera frame tocam[i]
. Note that bx[i] denotes the baseline with respect to the reference camera 0.
R0_rect
: rotation to account for rectification for points in the reference camera.Tr_velo_to_cam
: euclidean transformation from lidar to reference cameracam0
.
Projection Between Frames
Recall from linear algebra that a projection matrix, when expressed in homogenous coordinate, is simply a linear transformation that warps points through multiplication from one vector space to another vector space x’= Px. It can be composited to traverse across the different coordinate systems. I refer you to this amazing video for more explanation.
Transformation matrix in this context represents mainly the rigid body transformation between sensors and the perspective projection (collapsing column vector z) from 3D to 2D points. These matrices are given in the calibration files.
Projection from lidar to camera 2 project_velo_to_cam2
: Suppose we would like to convert Velodyne points into camera coordinate, the following transformations are considered.
proj_mat = P_rect2cam2 @ R_ref2rect @ P_velo2cam_ref
Note that the multiplication should be performed in homogenous coordinate to simplify the calculation. To convert to pixel coordinate, simply normalise by z-coordinate.
Fig 4. Transformation steps
Projection from camera to lidar coordinate: Annotation of 3D boxes are given in camera coordinate. If we would like to convert box vertices in the camera frame to lidar, project_cam2_to_velo
we compute the inverse rigid transformation and transform backwards.
R_ref2rect_inv = np.linalg.inv(R_ref2rect)
P_cam_ref2velo = np.linalg.inv(velo2cam_ref)proj_mat = R_ref2rect_inv @ P_cam_ref2velo
Boxes In Image
see render_image_with_boxes
Boxes are commonly used to represent other agents and pedestrians. It is simpler to acquire and annotate by defining 8 vertices to fully localise the object as a box model. In my opinion, boxes might not be the best to represent pedestrians as they are not rigid.
There are some other methods to represent objects which includes key points, cad model and segmentation masks that are worth considering.
Fig 5. Displaying boxes on the image plane
From the annotation, we are given the location of the box (t), the yaw angle (R) of the box in camera coordinates (save to assume no pitch and roll) and the dimensions: height (h), width (w) and length (l). Note that 3D boxes of objects are annotated in camera coordinate! Given this information, we can easily transform the box model to the exact location in the camera space.
Consider fig 5 above, each box instance origin is set to be at the base and centre, corresponding to the same height as the ego vehicle and ground level. To project 3D boxes to image:
- First, we obtain the box in camera coordinate via [R|t] where
R = roty
andt = (tx, ty, tz)
from the annotation inlabel.txt
- Next, apply perspective projection to the image plane
P_rect2cam2
PointCloud in Image [3D-2D]
see render_lidar_on_image
Fig 6. Colour-coded range value of Lidar points on the image
If we would like to process data in 2D, more information can be gathered by projecting point cloud onto the image to construct a sparse depth map representation using the corresponding lidar range value (z). The sparsity depends on the number of lidar beams that map to pixels.
Sparse depth map are convenient and accurate range data as compared to predicting depth map from a camera. A work that used sparse depth map to enhance monocular based detection is described in pseudo-lidar++.
To map points to pixel, this is a projective transformation from lidar to image plane.
- Compute projection matrix
project_velo_to_cam2
. - Project points to image plane.
- Remove points that lie outside of image boundaries.
Boxes in PointCloud [2D-3D]
see render_lidar_with_boxes
Visualising and working in Lidar space provides the most comprehensive understanding in terms of spatial reasoning. Also, we can easily change our camera viewpoint to look at the environment from a different perspective if required.
Fig 7. 3D boxes projected onto point clouds
In this example, instead of drawing all the scanned points from the 360 deg rotated LIDAR scanner, we shall only consider point clouds that lie within the field of view of the camera as shown in fig 4. The step taken are similar to the example on points cloud in image. Next, we will need to apply the inverse transformation to project the 3D boxes in camera coordinate to LIDAR using the projection project_cam2_to_velo
.
The steps are as follows:
- Compute projection matrix
project_velo_to_cam2
. - Project points to image plane.
- Remove points that lie outside of image boundaries.
- Project 3D boxes to LIDAR coordinate.
Data Format Description of KITTI Sequential Data
The data for training and testing can be found in the corresponding folders.
The sub-folders are structured as follows:
1. image_02/ contains the left color camera images (png)
2. label_02/ contains the left color camera label files (plain text files)
3. calib/ contains the calibration for all four cameras (plain text file)
The label files contain the following information, which can be read and
written using the matlab tools (readLabels.m, writeLabels.m) provided within
this devkit. All values (numerical or strings) are separated via spaces,
each row corresponds to one object. The 15 columns represent:
Values Description
1. 1 type Describes the type of object: ‘Car’, ‘Van’, ‘Truck’,‘Pedestrian’,‘Person_sitting’, ‘Cyclist’, ‘Tram’, ‘Misc’ or ‘DontCare’.
2. 1 truncated Float from 0 (non-truncated) to 1 (truncated), where
truncated refers to the object leaving image boundaries.
3. 1 occluded Integer (0,1,2,3) indicating occlusion state:
0 = fully visible, 1 = partly occluded 2 = largely occluded, 3 = unknown.
4. 1 alpha Observation angle of object, ranging [-pi..pi] .
5. 4 bbox 2D bounding box of object in the image (0-based index):
contains left, top, right, bottom pixel coordinates .
6. 3 dimensions 3D object dimensions: height, width, length (in meters).
7. 3 location 3D object location x,y,z in camera coordinates (in meters).
8. 1 Rotation ry around Y-axis in camera coordinates [-pi..pi].
9. 1 score Only for results: Float, indicating confidence in
detection, needed for p/r curves, higher is better.
Additional Points :
- Calibration is done with camera 0 as the reference sensor.
- rectification is done during the calibration for the correction of planar alignment between the cameras.
- Tr_velo_to_cam : euclidean transformation from lidar to reference camera.
- R0_rect : rotation to account for rectification of points.
- P_rect -> projective transformation with respective to the reference camera 0.
- Conversion of velodyne points to camera coordinate system the transformations need to be considered are :
proj_mat = P_rect2cam2 * R_ref2rect * P_velo2cam_ref. - The multiplication should be performed in homogenous coordinate to simplify the calculation. To convert to pixel coordinate, simply normalise by z-coordinate.
- But we no need to this if we are already present in the camera coordinate system.
- Projection of camera to lidar coordinate : Annotation of 3D boxes are given in the camera coordinate system. If we would like to convert them into lidar.
Then,
R_ref2rect_inv = np.linalg.inv(R_ref2rect)
P_cam_ref2velo = np.linalg.inv(velo2cam_ref)
proj_mat = R_ref2rect_inv @ P_cam_ref2velo.
10. (cx, cy, cz) are the coordinates of the object bbox in meters from LiDAR (in coordinate frame).
11. Can simply transform the (cx, cy, cz) to world coordinate frame if required, else the (cx, cy, cz) itself represents the distance (in meters) of the object bbox from the LiDAR sensor.