INFER: INtermediate representations for FuturE pRediction

Shashank Srikanth, Junaid Ahmed Ansari, R. Karnik Ram, Sarthak Sharma, J. Krishna Murthy, and K. Madhava Krishna

Mountain View

Future prediction Left: (top) An interesting scenario from KITTI (using Lidar and stereo) where the car (cyan bounding box) will turn left over the next 4 seconds. Using INFER, we predict this future trajectory, given the past trajectory (the trail behind the car). The predicted future trajectory (orange) and the ground truth trajectory (blue) are projected on to the image. On the top-right and bottom-right, we show zero-shot transfer results on the Cityscapes (using stereo) and Oxford Robot Car (using Lidar) which demonstrates cross-sensor and varying driving scenario transferability. For visualization, we register the predicted and ground truth trajectories in 3D for each of the dataset (shown below the image): The green 3D bounding box depicts the first sighting of the vehicle of interest which is also when we start preconditioning, and the red 3D bounding box indicates the start of prediction. We also register the Lidar/depth information (cyan-road, dark gray-lane, and magenta-road) to demonstrate the accuracy of our prediction.



arXiv Link of the paper can be found here.

Code can be found here.

The datasets can be found here.


Abstract

Deep learning methods have ushered in a new era for computer vision and robotics. With very accurate methods for object detection and semantic segmentation, we are now at a juncture where we can envisage the application of these techniques to perform higher-order understanding. One such application which we consider in this work, is predicting future states of traffic participants in urban driving scenarios. Specifically, we argue that constructing intermediate representations of the world using off-the-shelf computer vision models for semantic segmentation and object detection, we can train models that account for the multi-modality of future states, and at the same time transfer well across different train and test distributions (datasets). Our approach, dubbed INFER (INtermediate representations for distant FuturE pRediction), involves training an autoregressive model that takes in an intermediate representation of past states of the world, and predicts a multimodal distribution over plausible future states. The model consists of an Encoder-Decoder with ConvLSTM present along the skip connections, and in between the Encoder-Decoder. The network takes an intermediate representation of the scene and predicts the future locations of the Vehicle of Interest (VoI). We outperform the current best future prediction model on KITTI while predicting deep into the future (3 sec, 4 sec) by a significant margin. Contrary to most approaches dealing with future prediction that do not generalize well to datasets that they have not been trained on, we test our method on different datasets like Oxford RobotCar and Cityscapes, and show that the network performs well across these datasets which differ in scene layout, weather conditions, and also generalizes well across cross-sensor modalities. We carry out a thorough ablation study on our intermediate representation that captures the role played by different semantics. We conclude the results section by showcasing an important use case of future prediction : multi object tracking and exhibit results on select sequences from KITTI and Cityscapes.


Qualitative Results

Sample Video sequence in KITTI Tracking dataset

Sample transfer in Cityscapes dataset

Sample transfer in Oxford RobotCar dataset

Trajectories plotted on the Occupancy Grids for KITTI dataset


Some more qualitative plots




Results

We perform 5 fold cross validation while training and evaluating on KITTI. The error for each split is given in the below tables. We report the pixel loss (L2 norm) here and to get the error in m, multiply the values by 0.25. The average loss for a given model is computed as the weighted mean of the losses of each individual split. The weights used here are the number of sequences in each split.

INFER-Skip results with all channels (pixel loss)

1s 2s 3s 4s Num Sequences
Split 0 2.2 2.91 3.68 5.29 16
Split 1 1.97 2.72 2.75 2.75 19
Split 2 2.54 3.3 4.15 5.74 28
Split 3 2.32 3.16 3.86 4.82 22
Split 4 2.06 2.9 4.13 5.79 16

INFER results with all channels (pixel loss)

1s 2s 3s 4s Num Sequences
Split 0 2.82 3.8 5.56 7.9 16
Split 1 2.57 3.55 4.31 4.76 19
Split 2 2.4 3.98 5.64 7.75 28
Split 3 2.99 3.94 4.49 5.43 22
Split 4 1.44 1.7 2.65 4.19 16

Baseline (pixel loss)

1s 2s 3s 4s Num Sequences
Split 0 3.17 5.04 6.47 8.16 16
Split 1 1.66 2.80 3.71 4.55 19
Split 2 4.07 7.45 9.68 11.13 28
Split 3 3.50 5.08 6.29 7.67 22
Split 4 2.32 2.75 4.16 6.15 16

INFER w/o road (pixel loss)

1s 2s 3s 4s Num Sequences
Split 0 3.08 4.68 6.03 8.23 16
Split 1 3.76 6.08 9.73 13.83 19
Split 2 2.65 4.76 6.94 9.21 28
Split 3 2.32 4.16 6.49 9.18 22
Split 4 2.53 4.58 6.84 9.61 16

INFER w/o lane (pixel loss)

1s 2s 3s 4s Num Sequences
Split 0 2.34 3.42 4.85 7.32 16
Split 1 1.73 2.42 2.81 3.23 19
Split 2 2.97 3.76 4.46 5.5 28
Split 3 2.66 3.67 4.27 5.05 22
Split 4 1.29 1.4 2.01 3.06 16

INFER w/o obstacles (pixel loss)

1s 2s 3s 4s Num Sequences
Split 0 2.08 2.85 3.55 4.48 16
Split 1 2.42 3.25 3.38 3.23 19
Split 2 2.20 3.47 4.56 5.56 28
Split 3 2.65 3.84> 4.75 6.03 22
Split 4 1.39 2.23 3.42 5.17 16