Earlier this year, it was my pleasure to host a series of webinar meetup sessions attended by hundreds of industry professionals. The topic was Object Tracking Technology, a critical aspect for creating safe Autonomous Vehicles (AVs).

The first webinar took a look at the classic model-based object tracking techniques, while the second one focused on a newer method: learning-based object tracking. Below are brief summaries of this webinar series.

Session 1: Model-Based Object Tracking (April 2020)

In the Model-Based Tracking, the goal is to find the current state xk of a given tracked object at time k. This state may include its location, velocity, etc., and should be estimated from the input measurements zk of the tracked object. These can be the current estimated location of the tracked object. Examples for such measurements can be the current car velocity provided by the car signals, or the location of another car estimated by an object detection system.

A common assumption in model-based tracking is the Markovian assumption, i.e., that xk depends only on the previous state x(k-1) and that zk can be generated using only the current state xk This is characterized by the functions xk=fk (x₍(k-1),v(k-1)) and zk=hk (xk,nk), where v(k-1) and nk are noise vectors, which are assumed to be i.i.d. The functions fk and hk are assumed to be known (and if not, one may learn them).

With the above assumptions, one may estimate xk from x(k-1) and zk simply using the Bayes rule. In the case that fk and hk are linear, and the noise vectors are i.i.d Gaussians, this leads us to the popular Kalman Filter, which is simple to use and is computationally efficient . If fk and hk are non-linear but “smooth-enough’’ (i.e., do not change rapidly), a linearization of them may be used, which leads to the Extended Kalman Filter. In the more general case, one needs to make a more sophisticated approximation of the functions at each measurement using Monte-Carlo sampling, which leads to the Particle Filter. This filter is more accurate in these cases but is computationally demanding.

Session 2: Learning-Based Object Tracking (May 2020)

In Learning-Based Object Tracking, there are various strategies. Some of them simply try to estimate the functions fk and hk from the data, followed by using the model-based techniques. Others aim at estimating the current state directly from the previous one and the current measurements. Various other frameworks have been used for this task including reinforcement learning, recurrent neural networks and, the framework we’ll dive into here, the match-filter based solution.

For an object to be tracked, the match-filter (or correlation) based solution calculates a representation of both the object and the target frame, where the search is performed and then finds the new location of the object by finding the location with the largest correlation between the two. Learning such a representation is challenging, since one may encounter similar objects, occlusions, objects exiting the scene, occlusions, blurring, and other problems.

In single object tracking the current leading approaches are the Discriminative Correlation Filter (DCF) and Siamese Networks. In DCT the parameters of the network for calculating the representation are learned online. In Siamese networks training data is required. One may mitigate this need using a cycle consistency loss. In Single Object Tracking, one AV’s perception software is provided with initial details of a given target object of an arbitrary class. These details may be just the template (an image) of the target, where the goal is to track it. Some techniques to improve the tracking, such as, using localization loss, meta learning, probabilistic regression and multi-task learning has been recently proposed.

In Multi-Object Tracking (MOT), objects belonging to a predefined set of classes need to be identified and tracked in a given scene. This includes the re-identification of objects that leave and re-enter the scene. The leading strategy in MOT is tracking by detection, where first the objects in the scene are detected and then an association stage is performed to match objects across frames and to add and remove tracked identities. Recently, it has been shown that, by learning the detection and association filters jointly, it is possible to both improve the tracking accuracy and to accelerate the runtime. In these approaches, metric learning loss has been added to existing object detection frameworks and then used the features they extract to learn discriminative representations of the detected objects. This allows them to efficiently perform matches across frames (see here and here).

In more recent works, it has been shown that MOT performance can be improved further if the association/re-identification loss can be integrated also as part of the training of the neural network weights, e.g., by using a neural solver or an approximated loss function.

Conclusion

I hope these summaries are informative and helpful to anyone who missed the webinar sessions. With advanced groundwork in place, object tracking technology is ripe for continued testing and innovation. Its advancement will be critical to a safe AV future.