Contents


Feature Detection and Matching

Introduction

In Computer Vision we often deal with images indirectly by finding features and working with them instead of pixel values. Features can be anything but interest points have historically been the most useful, especially for single and multiple view geometry from images. Interest points can be any kind of junction including T, cross or corner but are often grouped together as corners. They are usually found by a filtering operation so the measure of corner-ness is referred to as the corner response.

When finding features the typical method is to detect interest points, select the best ones (keypoints) and then compute descriptors for each. A descriptor is an encoding of the region around the keypoint. Comparing descriptors from keypoints detected in multiple images should indicate the true matching points between them. There are multiple methods for interest point detection and multiple descriptor types that can be computed. Some feature types are now described, very compactly; please check the references at the end for full details.

SIFT (Scale Invariant Feature Transform) uses Difference of Gaussian filtering at multiple scales to find scale-space extrema; keypoints are found by fitting to the local samples. Low contrast points and points on edges that have low stability are rejected. A descriptor is computed from fitting local gradients. SIFT was protected but its patent has now expired.

SURF (Speeded Up Robust Features) uses integral images to speed up Gaussian filtering and a Hessian matrix approximation. Features are calculated at multiscale by changing size of filter not downsampling the image. A descriptor is computed using Haar wavelet filtering. SURF is patented, expiring in May 2027.

ORB (Oriented FAST and Rotated BRIEF) uses the FAST feature detector, looking at 9 pixels in a circle around a central pixel. Features are detected on a multiscale pyramid of the image. BRIEF descriptors are constructed from binary intensity comparisons of pixels in the patch around a feature.

SuperPoint is a deep method for finding feature points pretrained on a synthetic dataset of simple shapes with distinct corner points, then trained on those images warped by homographies to simulate different views of the same features. LightGlue is a transformer based method for matching points where self and cross attention add context to positional coding to develop a confidence classifier. Points that are confidently not matches are pruned before the next layer.

Pros

Cons

Implementation Notes

  1. Convert image to luminance/grey
  2. Detect interest points
  3. Sort interest points by response to choose best keypoints
  4. Determine orientation of keypoints
  5. Compute descriptors for keypoints using their orientation

  Code on github

Example

Detect and match features between 2 images:

detect-match -c current_frame.png -p previous_frame.png -k keypoints_image.png -m matches_image.png -n 2000 -a sift

The algorithm parameter -a can be set to sift, surf or orb. Note: SURF is only available if you have compiled OpenCV with the non-free components enabled.

The workfeatures.py script will extract frames from a video and run detect-match on consecutive frames.

To use SuperPoint and LightGlue, you can run using my techdemo docker image but a few additional set up steps are required.

The splg.py script can be used to detect and match features in all the consecutive frames in a directory of images,

splg.py --images_dir extracted_frames --output_dir results

Where extracted_frames is a directory of images in the format frame_%05d.png and results is the name of the directory to write images of the keypoints and matches.

In the following videos, the methods are laid out top left = SIFT, top right = SURF, bottom left = ORB, bottom right = SuperPoint/LightGlue.

Comparison of Detected Keypoints

Points that are consistently detected on objects over multiple frames and have stability are preferred. Note how SuperPoint finds many features on the surface of the water and in the sky. The border on the SuperPoint image is due to the input being scaled for inference.

Comparison of Detected Matches: Each Quadrant Shows a Pair of Consecutive Frames

Matches that are stable across many frames are better than matches that are detected randomly or have short duration. The LightGlue matches from points on the water surface appear to be correct.

AlgorithmTime (microseconds)Matches
SIFT5123962000
SURF2465091994
ORB429612000
SP & LG71331 *1603

* Python implementation - not strictly comparable.

The table above shows a comparison of the methods for a typical image pair. Each algorithm was set to retain the best 2000 detected keypoints.

Summary


Feature Tracking

Introduction

Features can be tracked by detecting interest points in every frame of a video and finding matches between each pair of consecutive frames. Matching points requires descriptors to be computed and each point's descriptor compared against its candidate matches. Methods that explicitly track features do not need to compute descriptors or do additional matching because at video frame rates most features move only a small amount. Furthermore, if the global motion is slow then most of the feature tracks remain within the frame and the computational burden of detecting new features is not required.

The KLT (Kanade–Lucas–Tomasi) tracking method is a classic computer vision method based on interest points obtained from Good Features to Track. Given a start position, an iterative search process is applied to match a patch centred at the start position in the first image to a search position in the second image.

CoTracker is a deep method for tracking using a transformer based model. Convolutional features are extracted and tracked in small overlapping windows. Cross track attention plus extra support points make this method more robust to noise and occlusions. Features tend to either be on the same object or background and attention captures this relationship. In CoTracker 3 existing trackers were used as teachers to generate labelled training data and the model was pre-trained using synthetic data.

Pros

Cons

Implementation Notes

For the KLT method,

  1. Open video
  2. Read a frame and convert to grey/luminance
  3. Detect initial set of features to track
  4. Repeat:
    1. Read a new frame and convert to grey/luminance
    2. Iterative search to match the features from the previous frame to the new one
    3. Update each feature position
    4. Some features will not be trackable (they go off frame, are occluded, etc) so if the number of features is below a threshold then detect additional features in the new frame

  Code on github

The CoTracker model is available on Torch Hub. We only have to pass it the frames using a Python imageio object.

Example

To run KLT tracking, the demo can be built in the techdemo docker container - simply run make. To track a video,

klt-tracker -i video.mp4 -o tracks.mp4 -n 2000

KLT Tracking Result

CoTracker can be run in the techdemo docker container, but some set up steps are required (see the README). To track a video,

python3 online_demo.py --video_path video.mp4 --grid_size 20

CoTracker Result

Summary


References