Feature Detection, Matching and Tracking

Contents

Feature Detection and Matching
Feature Tracking

Feature Detection and Matching

Introduction

In Computer Vision we often deal with images indirectly by finding features and working with them instead of pixel values. Features can be anything but interest points have historically been the most useful, especially for single and multiple view geometry from images. Interest points can be any kind of junction including T, cross or corner but are often grouped together as corners. They are usually found by a filtering operation so the measure of corner-ness is referred to as the corner response.

When finding features the typical method is to detect interest points, select the best ones (keypoints) and then compute descriptors for each. A descriptor is an encoding of the region around the keypoint. Comparing descriptors from keypoints detected in multiple images should indicate the true matching points between them. There are multiple methods for interest point detection and multiple descriptor types that can be computed. Some feature types are now described, very compactly; please check the references at the end for full details.

SIFT (Scale Invariant Feature Transform) uses Difference of Gaussian filtering at multiple scales to find scale-space extrema; keypoints are found by fitting to the local samples. Low contrast points and points on edges that have low stability are rejected. A descriptor is computed from fitting local gradients. SIFT was protected but its patent has now expired.

SURF (Speeded Up Robust Features) uses integral images to speed up Gaussian filtering and a Hessian matrix approximation. Features are calculated at multiscale by changing size of filter not downsampling the image. A descriptor is computed using Haar wavelet filtering. SURF is patented, expiring in May 2027.

ORB (Oriented FAST and Rotated BRIEF) uses the FAST feature detector, looking at 9 pixels in a circle around a central pixel. Features are detected on a multiscale pyramid of the image. BRIEF descriptors are constructed from binary intensity comparisons of pixels in the patch around a feature.

SuperPoint is a deep method for finding feature points pretrained on a synthetic dataset of simple shapes with distinct corner points, then trained on those images warped by homographies to simulate different views of the same features. LightGlue is a transformer based method for matching points where self and cross attention add context to positional coding to develop a confidence classifier. Points that are confidently not matches are pruned before the next layer.

Pros

Mix and match feature detector method with different descriptors = best of both worlds
Feature response allows ranking of best features, useful for other applications
Deep feature detection and matching models are trained to find matches whereas classical methods simply compare calculated measures

Cons

Time spent computing descriptors and evaluating comparisons risks repeating processing performed for detection
Patented techniques/restrictive open source licenses may make some techniques more attractice than others

Implementation Notes

Convert image to luminance/grey
Detect interest points
Sort interest points by response to choose best keypoints
Determine orientation of keypoints
Compute descriptors for keypoints using their orientation

Code on github

Example

Detect and match features between 2 images:

detect-match -c current_frame.png -p previous_frame.png -k keypoints_image.png -m matches_image.png -n 2000 -a sift

The algorithm parameter -a can be set to sift, surf or orb. Note: SURF is only available if you have compiled OpenCV with the non-free components enabled.

The workfeatures.py script will extract frames from a video and run detect-match on consecutive frames.

To use SuperPoint and LightGlue, you can run using my techdemo docker image but a few additional set up steps are required.

The splg.py script can be used to detect and match features in all the consecutive frames in a directory of images,

splg.py --images_dir extracted_frames --output_dir results

Where extracted_frames is a directory of images in the format frame_%05d.png and results is the name of the directory to write images of the keypoints and matches.

In the following videos, the methods are laid out top left = SIFT, top right = SURF, bottom left = ORB, bottom right = SuperPoint/LightGlue.

Comparison of Detected Keypoints

Points that are consistently detected on objects over multiple frames and have stability are preferred. Note how SuperPoint finds many features on the surface of the water and in the sky. The border on the SuperPoint image is due to the input being scaled for inference.

Comparison of Detected Matches: Each Quadrant Shows a Pair of Consecutive Frames

Matches that are stable across many frames are better than matches that are detected randomly or have short duration. The LightGlue matches from points on the water surface appear to be correct.

Algorithm	Time (microseconds)	Matches
SIFT	512396	2000
SURF	246509	1994
ORB	42961	2000
SP & LG	71331 *	1603

* Python implementation - not strictly comparable.

The table above shows a comparison of the methods for a typical image pair. Each algorithm was set to retain the best 2000 detected keypoints.

Summary

Keypoints are very useful for multiple view geometry
Corner based features usually appear at the edges and details; deep features can be found on very fine detail and less textured areas of an image as well

Feature Tracking

Introduction

Features can be tracked by detecting interest points in every frame of a video and finding matches between each pair of consecutive frames. Matching points requires descriptors to be computed and each point's descriptor compared against its candidate matches. Methods that explicitly track features do not need to compute descriptors or do additional matching because at video frame rates most features move only a small amount. Furthermore, if the global motion is slow then most of the feature tracks remain within the frame and the computational burden of detecting new features is not required.

The KLT (Kanade–Lucas–Tomasi) tracking method is a classic computer vision method based on interest points obtained from Good Features to Track. Given a start position, an iterative search process is applied to match a patch centred at the start position in the first image to a search position in the second image.

CoTracker is a deep method for tracking using a transformer based model. Convolutional features are extracted and tracked in small overlapping windows. Cross track attention plus extra support points make this method more robust to noise and occlusions. Features tend to either be on the same object or background and attention captures this relationship. In CoTracker 3 existing trackers were used as teachers to generate labelled training data and the model was pre-trained using synthetic data.

Pros

KLT tracking is highly suitable for multithread implementation as are deep models; both can take advantage of GPU processing

Cons

Tracking algorithms are meant to work with video sequences where the motion between frames is small; this could be a problem if the video is streamed and some frames are lost in transmission

Implementation Notes

For the KLT method,

Open video
Read a frame and convert to grey/luminance
Detect initial set of features to track
Repeat:
1. Read a new frame and convert to grey/luminance
2. Iterative search to match the features from the previous frame to the new one
3. Update each feature position
4. Some features will not be trackable (they go off frame, are occluded, etc) so if the number of features is below a threshold then detect additional features in the new frame

Code on github

The CoTracker model is available on Torch Hub. We only have to pass it the frames using a Python imageio object.

Example

To run KLT tracking, the demo can be built in the techdemo docker container - simply run make. To track a video,

klt-tracker -i video.mp4 -o tracks.mp4 -n 2000

KLT Tracking Result

CoTracker can be run in the techdemo docker container, but some set up steps are required (see the README). To track a video,

python3 online_demo.py --video_path video.mp4 --grid_size 20

CoTracker Result

Summary

Tracking is faster than feature detection and matching but the assumption is that the input is consecutive frames from a video sequence; detection and matching works better if images are not ordered.

References

D.G. Lowe, "Distinctive image features from scale-invariant keypoints", International Journal of Computer Vision, vol. 60, no. 2, pp 91–110, DOI 10.1023/B:VISI.0000029664.99615.94, 2004
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, "Speeded-Up Robust Features (SURF)", European Conference on Computer Vision (ECCV), DOI 10.1007/11744023_32, 2006
Ethan Rublee, Vincent Rabaud, Kurt Konolige and Gary Bradski, "ORB: an efﬁcient alternative to SIFT or SURF", IEEE International Conference on Computer Vision (ICCV), DOI 10.1109/ICCV.2011.6126544, 2011
Jianbo Shi and Carlo Tomasi, "Good features to track", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), DOI 10.1109/CVPR.1994.323794, 1994
Bruce D. Lucas and Takeo Kanade, "An Iterative Image Registration Technique with an Application to Stereo Vision", International Joint Conference on Artificial Intelligence (IJCAI), pp 674-670, 1981
Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich, "SuperPoint: Self-Supervised Interest Point Detection and Description", IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018
Philipp Lindenberger, Paul-Edouard Sarlin and Marc Pollefeys, "LightGlue: Local Feature Matching at Light Speed", International Conference on Computer Vision (ICCV), pp 17627-17638, 2023
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi and Christian Rupprecht, "CoTracker: It is Better to Track Together", European Conference on Computer Vision (ECCV), 2024
Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi and Christian Rupprecht, "CoTracker 3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos", arXiv:2410.11831, 2024