From 2D images we can extract a limited range of information like width, height and color. These can be useful to determine the regions of interest in our images: street signs, lanes, or even roads.
However, for a more accurate detection, the depth perception
is crucial. Here comes the 3D reconstruction into play. Extracting a 3rd dimension, the depth, we can determine how far from the
camera the regions of interest are and consequently, their shape. This way we can distinguish the road from the obstacles (cars, pedestrians, curbstones) simply because we know that the road has an increasing distance from the camera while the objects have a constant distance (fig. 1).
The most effective way of constructing 3D images is using a stereo camera. The stereo camera consists of 2 or more lenses that
allow the camera to simulate human binocular vision.
Because we work with mobile phone mono cameras, we can simulate the stereo vision by using multiple photos of the same object taken at various positions. The position known from the GPS information must have a millimeter precision. The process of estimating the 3D structure of a scene from a set of 2D images is called Structure from Motion or SFM (fig 2).
Between 2 consecutive photos we find corresponding feature points, pixels that appear on both photos.
The main steps in SFM are:
- Calibrate the camera (determine intrinsics, extrinsics, and distortion coefficients)
- Detect the matching features between the 2 images
- Perform the triangulation (3D reconstruction)
Step 1. Camera Calibration
First of all we must be pay attention to the coordinate systems we work with (fig. 3).
The transformation from one coordinate system to another can be described by a series of matrix multiplications. The conversion from world coordinates to pixel coordinates is called Forward Projection. The opposite conversion – which we compute in this algorithm – is called Backward Projection (from pixel coordinates to world coordinates) (fig. 4).
Our goal is to describe this sequence of transformations by a big matrix equation! (fig. 5)
The intrinsic matrix is determined using a calibration algorithm (eg. chess table pattern from OpenCV). It consists of intrinsic parameters: focal length, center of the image, aspect ratio of a pixel (fig.6).
The distortion coefficients are used to correct the image and the 3D cloud points positioning. The images taken with a camera usually have many distortions: barrel, fish eye, etc.
The extrinsic matrix is composed of a rotation and a translation matrix. The rotation matrix is also composed of three other rotation matrices – roll, pitch, yaw rotation matrices (fig. 7). These rotations of camera 2 are computed relative to the camera 1. The translation matrix (fig. 8) is computed, by subtracting the position matrix of camera 1 from position matrix of camera 2.
Having the extrinsic and intrinsic matrices we can compute the projection matrix.
Step 2. Feature Matching
There are a lot of feature matching algorithms between 2 consecutive photos (SURF, SIFT, ORB). The SURF algorithm is used to find the matching points between 2 consecutive images (around 2- 3 meters apart) and RANSAC algorithm to filter the outliers (gif 1).
Step 3. Triangulation
Having the matched points and the projection matrix we can perform the triangulation algorithm and plot the 3D points. In the plots above you can see the original photo (fig. 9) and the cloud of points from different perspectives computed using both 2 (fig. 10) and 5 photos (fig. 11).
As Future Work we need to focus on:
- Gather more precise data (orientation and position)
- Remove lens distortions
- Fit a surface on the cloud points to detect the road profile
- Research on other feature matching algorithms