Journal of Systems Engineering and Electronics ›› 2026, Vol. 37 ›› Issue (1): 36-44.doi: 10.23919/JSEE.2026.000024

• PERCEPTION, CONTROL, AND DECISION-MAKING OF EMBODIED INTELLIGENT SYSTEMS • Previous Articles     Next Articles

A lightweight pure visual BEV perception method based on dual distillation of spatial-temporal knowledge

Bingdong LIU(), Ruihang YU(), Zhiming XIONG(), Meiping WU()   

  • Received:2025-12-03 Online:2026-02-18 Published:2026-03-09
  • Contact: Ruihang YU E-mail:2546104693@qq.com;yuruihang@nudt.edu.cn;scottxzm@163.com;meipingwu@263.com
  • About author:
    LIU Bingdong was born in 2002. He is currently pursuing his M.S. degree at the National University of Defense Technology. His research interests are environmental perception and target detection. E-mail: 2546104693@qq.com

    YU Ruihang was born in 1988. He received his Ph.D. degree from National University of Defense Technology. He is an associate professor in National University of Defense Technology. His research interests include Global Navigation Satellite System/Inertial Navigation System-integrated navigation systems and gravimetry. E-mail: yuruihang@nudt.edu.cn

    XIONG Zhiming was born in 1991. He received his Ph.D. degree from National University of Defense Technology. He is an assistant researcher in National University of Defense Technology. His research interests include under water integrated navigation systems and gravimetry. E-mail: scottxzm@163.com

    WU Meiping was born in 1970. He received his Ph.D. degree from National University of Defense Technology. He is a professor in National University of Defense Technology. His research interests include navigation technology, gravity measurement, and global positioning system navigation technology. E-mail: meipingwu@263.com
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (42476084; 62203456; 42276199), the Stable Support Project of National Key Laboratory (WDZC 20245250302), and the National Key R&D Program of China (2024YFC2813502; 2024YFC2813302).

Abstract:

Bird’s-eye-view (BEV) perception is a core technology for autonomous driving systems. However, existing solutions face the dilemma of high costs associated with multi-modal methods and limited performance of vision-only approaches. To address this issue, this paper proposes a framework named “a lightweight pure visual BEV perception method based on dual distillation of spatial-temporal knowledge”. This framework innovatively designs a lightweight vision-only student model based on ResNet, which leverages a dual distillation mechanism to learn from a powerful teacher model that integrates temporal information from both image and light detection and ranging (LiDAR) modalities. Specifically, we distill efficient multi-modal feature extraction and spatial fusion capabilities from the BEVFusion model, and distill advanced temporal information fusion and spatiotemporal attention mechanisms from the BEVFormer model. This dual distillation strategy enables the student model to achieve perception performance close to that of multi-modal models without relying on LiDAR. Experimental results on the nuScenes dataset demonstrate that the proposed model significantly outperforms classical vision-only algorithms, achieves comparable performance to current state-of-the-art vision-only methods on the nuScenes detection leaderboard in terms of both mean average precision (mAP) and the nuScenes detection score (NDS) metrics, and exhibits notable advantages in inference computational efficiency. Although the proposed dual-teacher paradigm incurs higher offline training costs compared to single-model approaches, it yields a streamlined and highly efficient student model suitable for resource-constrained real-time deployment. This provides an effective pathway toward low-cost, high-performance autonomous driving perception systems.

Key words: 3D object detection, bird’s-eye-view (BEV), knowledge distillation, multimodal fusion, lightweight model