Camera Geometry and Pin Hole model for Camera Calibration.
Understanding how the pinhole camera works ??
The pinhole camera model explains the relationship between a point in the world and the projection on the image plane (image sensor).
We have to know how to project the points in the real world into the camera sensor.
THE PINHOLE MODEL:
If we use a wide-open camera sensor, we will end up with blurry images, because the imaging sensor collects light rays from multiple points on the object at the same location on the sensor.
The solution to this problem is to put a barrier in front of the imaging sensor with a tiny hole.
The barrier allows only a limited number of light rays to pass through the hole, and reduces the blurriness of the image.
Usually, the pinhole camera parameters are represented in a 3 × 4 matrix called the camera matrix.
Focal length: the distance between the pinhole and the image plane.
The focal length and the camera center are the camera intrinsic parameters, K. (K is an industry norm to express the intrinsic matrix.)
Intrinsic parameters
(Aka, the camera matrix.)
intrinsic parameters, K.
(Cx, Cy) : camera center in pixels.
(fx, fy) : Focal length in pixels.
fx = F/px
fy = F/py
F : Focal length in world units (e.g. millimeters.)
Pixel Skew
(Px, Py) : Size of the pixel in world units.
s : Skew coefficient, which is non-zero if the image axes are not perpendicular.
s = fx tan(α)
# World Coordinate System
Oworld = [Xw, Yw, Zw]# Camera Coordinate System
Ocamera = [Xc, Yc, Zc]# Pixel Coordinate System
Oimage = [u,v]
Calculation:
Now we can frame the projection problem (World Coordinates → Image Coordinates) as
- World coordinates → Camera coordinates
- Camera coordinates → Image coordinate
Oworld [Xw,Yw,Zw] → Oimage [u,v]
How? by using Linear Algebra!
1. World coordinates → Camera coordinates
Ocamera = [R|t] * Oworld
2. Camera coordinates → Image coordinate
Oimage = K * Ocamera
Remind me what K (camera intrinsic parameter) was?
intrinsic parameters, K: f for focal length, c for camera center, which are camera specific params
Both steps 1 and 2 are just matrix multiplications. Therefore it can be re-written (combined) as:
Oimage = P * Oworld = K[R|t] * Oworld
Let P = K[R|t]
P as Projection
Wait, K is (3,3) matrix. [R|t] is (3,4). (| means you are concatenating matrix R with vector t.) Oworld [Xw,Yw,Zw] is (3,1).
Then you can’t multiply K[R|t] (3,4) with Oworld [Xw,Yw,Zw] (3,1)!
😎 We can resolve this by adding one at the end the Oworld vector [Xw,Yw,Zw,1], called homogeneous coordinate (or projective coordinate).
If you want to further transform image coordinates to pixel coordinates: Divide x and y by z to get homogeneous coordinates in the image plane.
[x, y, z] -> [u, v, 1] = 1/z * [x, y, z]
This is it. This is the core. This simple projection principle will be used in every 3d visual perception algorithm, from object detection to 3d scene reconstruction.
In real life, there will be more complex scenarios, e.g. non-square pixels, camera access skew, distortion, non-unit aspect ratio, etc. However, they only change the camera matrix K, and the equations will still be the same.
A few things to note:
a) The rotation matrix (R) and the translation vector (t) are called extrinsic parameters because they are “external” to the camera.
The translation vector t can be interpreted as the position of the world origin in camera coordinates, and the columns of the rotation matrix R represent the directions of the world-axes in camera coordinates. This can be a little confusing to interpret because we are used to thinking in terms of the world coordinates.
b) Usually, multiple sensors (e.g. camera, lidar, radar, etc.) are used for perception in self-driving vehicles. Each sensor comes with its own extrinsic parameters that define the transform from the sensor frame to the vehicle frame.
c) Image coordinate (virtual image plane) [u, v] starts from the top left corner of the virtual image plane. That’s why we adjust the pixel locations to the image coordinate frame.
Seeing is believing
Take a look at the function project_ego_to_image
. It calls two functions in a row, project_ego_to_cam
first, thenproject_cam_to_image
, just as we converted the world coordinate into the image coordinate by breaking it down into 2 steps: World coordinates → Camera coordinates, then Camera coordinates → Image coordinate.
cart2hom
converts Cartesian coordinates into Homogeneous coordinates.