Future prediction Left: (top) An interesting scenario from KITTI (using Lidar and stereo)
where the car (cyan bounding box) will turn left over the next 4 seconds. Using INFER, we predict this future trajectory, given the past
trajectory (the trail behind the car). The predicted future trajectory (orange) and the ground truth trajectory (blue) are projected on to the
image. On the top-right and bottom-right, we show zero-shot transfer results on the Cityscapes (using stereo) and Oxford Robot Car (using Lidar)
which demonstrates cross-sensor and varying driving scenario transferability. For visualization, we register the predicted and ground truth
trajectories in 3D for each of the dataset (shown below the image): The green 3D bounding box depicts the first sighting of the vehicle of
interest which is also when we start preconditioning, and the red 3D bounding box indicates the start of prediction. We also register the
Lidar/depth information (cyan-road, dark gray-lane, and magenta-road) to demonstrate the accuracy of our prediction.
arXiv Link of the paper can be found here.
Code can be found here.
The datasets can be found here.
Abstract
Deep learning methods have ushered in a new era for computer vision and robotics. With very accurate methods for object detection and semantic
segmentation, we are now at a juncture where we can envisage the application of these techniques to perform higher-order understanding. One such
application which we consider in this work, is predicting future states of traffic participants in urban driving scenarios. Specifically, we argue
that constructing intermediate representations of the world using off-the-shelf computer vision models for semantic segmentation and object
detection, we can train models that account for the multi-modality of future states, and at the same time transfer well across different train and
test distributions (datasets).
Our approach, dubbed INFER (INtermediate representations for distant FuturE pRediction), involves training an autoregressive model that takes in an
intermediate representation of past states of the world, and predicts a multimodal distribution over plausible future states. The model consists of an
Encoder-Decoder with ConvLSTM present along the skip connections, and in between the Encoder-Decoder. The network takes an intermediate
representation of the scene and predicts the future locations of the Vehicle of Interest (VoI). We outperform the current best future prediction
model on KITTI while predicting deep into the future (3 sec, 4 sec) by a significant margin. Contrary to most approaches dealing with future
prediction that do not generalize well to datasets that they have not been trained on, we test our method on different datasets like Oxford RobotCar and
Cityscapes, and show that the network performs well across these datasets which differ in scene layout, weather conditions, and also generalizes well
across cross-sensor modalities. We carry out a thorough ablation study on our intermediate representation that captures the role played by different
semantics. We conclude the results section by showcasing an important use case of future prediction : multi object tracking and exhibit results on select
sequences from KITTI and Cityscapes.