Learning Where to Focus for Efficient Video Object Detection
Zhengkai Jiang1,2Yu Liu3Ceyuan Yang3Jihao Liu3Peng Gao3
Qian Zhang4Shiming Xiang1,2Chunhong Pan1,2 
1National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 2School of Artificial Intelligence, University of Chinese Academy of Sciences,
3The Chinese University of Hong Kong, 4Horizon Robotics
Overview
Previous approaches exploit to propagate and aggregate features across video frames by using optical flow-warping. However, directly applying image-level optical flow onto the high-level features might not establish accurate spatial correspondences. Therefore, a novel module called Learnable Spatio-Temporal Sampling (LSTS) has been proposed to learn semantic-level correspondences among frame features accurately. The sampled locations are first randomly initialized, then updated iteratively to find better spatial correspondences guided by detection supervision progressively. Besides, Sparsely Recursive Feature Updating (SRFU) module and Dense Feature Aggregation (DFA) module are also introduced to model temporal relations and enhance per-frame features, respectively. The proposed method achieves state-of-the-art performance on the ImageNet VID dataset with less computational complexity and real-time speed.
Results
  • Quantitive Results
  • Our LSTS could achieve 77.2% mAP on the mainstream benchmarks of video object detection, which basically outperforms other state-of-the-art methods considering accuracy and efficiency. More detailed comparison and ablation studie are presented in our paper.

  • Visualization of the Statistic Distribution
  • Figure (a) and Figure (b) indicate that the distribution of the learned sampling locations is much closer to the distribution calculated by the Datsets.

    Bibtex
    @inproceedings{jiang2020learning,
      title   = {Learning Where to Focus for Efficient Video Object Detection},
      author  = {Jiang, Zhengkai and Liu, Yu and Yang, Ceyuan and Liu, Jihao and Gao, Peng and Zhang, Qian and Xiang, Shiming and Pan, Chunhong},
      journal = {European Conference on Computer Vision (ECCV)},
      year    = {2020}
    }