A lightweight pure visual BEV perception method based on dual distillation of spatial-temporal knowledge

doi:10.23919/JSEE.2026.000024

Journal of Systems Engineering and Electronics ›› 2026, Vol. 37 ›› Issue (1): 36-44.doi: 10.23919/JSEE.2026.000024

• PERCEPTION, CONTROL, AND DECISION-MAKING OF EMBODIED INTELLIGENT SYSTEMS • Previous Articles Next Articles

A lightweight pure visual BEV perception method based on dual distillation of spatial-temporal knowledge

Bingdong LIU(), Ruihang YU(), Zhiming XIONG(), Meiping WU()

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

Received:2025-12-03 Online:2026-02-18 Published:2026-03-09
Contact: Ruihang YU E-mail:2546104693@qq.com;yuruihang@nudt.edu.cn;scottxzm@163.com;meipingwu@263.com
About author:
LIU Bingdong was born in 2002. He is currently pursuing his M.S. degree at the National University of Defense Technology. His research interests are environmental perception and target detection. E-mail: 2546104693@qq.com

YU Ruihang was born in 1988. He received his Ph.D. degree from National University of Defense Technology. He is an associate professor in National University of Defense Technology. His research interests include Global Navigation Satellite System/Inertial Navigation System-integrated navigation systems and gravimetry. E-mail: yuruihang@nudt.edu.cn

XIONG Zhiming was born in 1991. He received his Ph.D. degree from National University of Defense Technology. He is an assistant researcher in National University of Defense Technology. His research interests include under water integrated navigation systems and gravimetry. E-mail: scottxzm@163.com

WU Meiping was born in 1970. He received his Ph.D. degree from National University of Defense Technology. He is a professor in National University of Defense Technology. His research interests include navigation technology, gravity measurement, and global positioning system navigation technology. E-mail: meipingwu@263.com
Supported by:
This work was supported by the National Natural Science Foundation of China (42476084; 62203456; 42276199), the Stable Support Project of National Key Laboratory (WDZC 20245250302), and the National Key R&D Program of China (2024YFC2813502; 2024YFC2813302).

Abstract

Abstract:

Bird’s-eye-view (BEV) perception is a core technology for autonomous driving systems. However, existing solutions face the dilemma of high costs associated with multi-modal methods and limited performance of vision-only approaches. To address this issue, this paper proposes a framework named “a lightweight pure visual BEV perception method based on dual distillation of spatial-temporal knowledge”. This framework innovatively designs a lightweight vision-only student model based on ResNet, which leverages a dual distillation mechanism to learn from a powerful teacher model that integrates temporal information from both image and light detection and ranging (LiDAR) modalities. Specifically, we distill efficient multi-modal feature extraction and spatial fusion capabilities from the BEVFusion model, and distill advanced temporal information fusion and spatiotemporal attention mechanisms from the BEVFormer model. This dual distillation strategy enables the student model to achieve perception performance close to that of multi-modal models without relying on LiDAR. Experimental results on the nuScenes dataset demonstrate that the proposed model significantly outperforms classical vision-only algorithms, achieves comparable performance to current state-of-the-art vision-only methods on the nuScenes detection leaderboard in terms of both mean average precision (mAP) and the nuScenes detection score (NDS) metrics, and exhibits notable advantages in inference computational efficiency. Although the proposed dual-teacher paradigm incurs higher offline training costs compared to single-model approaches, it yields a streamlined and highly efficient student model suitable for resource-constrained real-time deployment. This provides an effective pathway toward low-cost, high-performance autonomous driving perception systems.

Key words: 3D object detection, bird’s-eye-view (BEV), knowledge distillation, multimodal fusion, lightweight model

Bingdong LIU, Ruihang YU, Zhiming XIONG, Meiping WU. A lightweight pure visual BEV perception method based on dual distillation of spatial-temporal knowledge[J]. Journal of Systems Engineering and Electronics, 2026, 37(1): 36-44.

Figures/Tables 7

Fig 1

Fig 2

Fig 3

Table 1

Table 2

Table 3

Table 4

References 33

1	LI H Y, SIMA C H, DAI J F, et al Delving into the devils of bird’s-eye-view perception: a review, evaluation and recipe. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2024, 46 (4): 2151- 2170. doi: 10.1109/TPAMI.2023.3333838
2	LIU Z J, TANG H T, AMINI A, et al. BEVFusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. https://arxiv.org/abs/2205.13542.
3	LI Q, WANG Y, WANG Y L, et al. HDmapnet: an online HD map construction and evaluation framework. Proc. of the International Conference on Robotics and Automation, 2022: 4628−4634.
4	PAN B, SUN J K, LEUNG H Y T, et al Cross-view semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters, 2020, 5 (3): 4867- 4873. doi: 10.1109/LRA.2020.3004325
5	ZHOU B, KRAHENBUHL P. Cross-view transformers for real-time map-view semantic segmentation. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 13760−13769.
6	LI Z Q, WANG W H, LI H Y, et al BEVFormer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2025, 47 (3): 2020- 2036. doi: 10.1109/TPAMI.2024.3515454
7	GEHRING J, AULI M, GRANGIER D, et al. Convolutional sequence to sequence learning. Proc. of the International Conference on Machine Learning, 2017: 1243−1252.
8	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770−778.
9	LANG A H, VORA S, CAESAR H, et al. Pointpillars: fast encoders for object detection from point clouds. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 12697−12705.
10	YANG Z D, ZENG A L, LI Z, et al. From knowledge distillation to self-knowledge distillation: a unified approach with normalized loss and customized soft labels. Proc. of the IEEE/CVF International Conference on Computer Vision, 2023: 17185−17194.
11	HAN K, WANG Y H, CHEN H T, et al A survey on vision transformer. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2022, 45 (1): 87- 110.
12	PHILION J, FIDLER S. Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. Proc. of the European Conference on Computer Vision, 2020: 194−210.
13	HUANG J J, HUANG G, ZHU Z, et al. BEVDet: high-performance multi-camera 3D object detection in bird-eye-view. https://arxiv.org/abs/2112.11790.
14	HUANG J J, HUANG G. BEVDet4D: exploit temporal cues in multi-camera 3D object detection. https://arxiv.org/abs/2203.17054.
15	LI Y H, GE Z, YU G Y, et al. BEVDepth: acquisition of reliable depth for multi-view 3D object detection. Proc. of the AAAI Conference on Artificial Intelligence, 2023: 1477−1485.
16	HU H T, WANG F Y, SU J W, et al. EA-LSS: edge-aware lift-splat-shot framework for 3D BEV object detection. https://arxiv.org/abs/2303.17895.
17	LIU Y F, WANG T C, ZHANG X Y, et al. PETR: position embedding transformation for multi-view 3D object detection. Proc. of the European Conference on Computer Vision, 2022: 531−548.
18	WANG S H, LIU Y F, WANG T C, et al. Exploring object-centric temporal modeling for efficient multi-view 3D object detection. Proc. of the IEEE/CVF International Conference on Computer Vision, 2023: 3621−3631.
19	VORA S, LANG A H, HELOU B, et al. Pointpainting: sequential fusion for 3D object detection. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 4604−4612.
20	CAI H X, ZHANG Z Y, ZHOU Z Y, et al. BEVFusion4D: learning LIDAR-camera fusion under bird’s-eye-view via cross-modality guidance and temporal aggregation. https://arxiv.org/abs/2303.17099.
21	ZHANG H C, LIANG L, ZENG P X, et al. SparseLIF: high-performance sparse LIDAR-camera fusion for 3D object detection. Proc. of the European Conference on Computer Vision, 2024: 109−128.
22	ZHOU S C, LIU W Z, HU C, et al. UniDistill: a universal cross-modality knowledge distillation framework for 3D object detection in bird’s-eye view. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 5116−5125.
23	LI J N, LU M, LIU J M, et al BEV-LGKD: a unified lidar-guided knowledge distillation framework for multi-view BEV 3D object detection. IEEE Trans. on Intelligent Vehicles, 2023, 9 (1): 2489- 2498. doi: 10.1109/tiv.2023.3319430
24	JIANG Z, ZHANG J Q, ZHANG Y N, et al. FSD-BEV: foreground self-distillation for multi-view 3D object detection. Proc. of the European Conference on Computer Vision, 2024: 110−126.
25	WU Z F, SHEN C H, VAN DEN HENGEL A Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recognition, 2019, 90, 119- 133. doi: 10.1016/j.patcog.2019.01.006
26	CAESAR H, BANKITI V, LANG A H, et al. Nuscenes: a multimodal dataset for autonomous driving. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 11621−11631.
27	GEIGER A, LENZ P, STILLER C, et al Vision meets robotics: the KITTI dataset. The International Journal of Robotics Research, 2013, 32 (11): 1231- 1237. doi: 10.1177/0278364913491297
28	NADERI B, CUTLER R, KHONGBANTABAM N S, et al. VCD: a video conferencing dataset for video compression. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2024: 3970−3974.
29	LIU H S, TENG Y, LU T, et al. SparseBEV: high-performance sparse 3D object detection from multi-camera videos. Proc. of the IEEE/CVF International Conference on Computer Vision, 2023: 18580−18590.
30	JI H, NI T, HUANG X F, et al. RoPETR: improving temporal camera-only 3D detection by integrating enhanced rotary position embedding. https://arxiv.org/abs/2504.12643.
31	WANG Z T, HUANG Z H, GAO Y L, et al MV2DFusion: leveraging modality-specific object semantics for multi-modal 3D detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2026, 48 (1): 609- 623. doi: 10.1109/TPAMI.2025.3609348
32	WANG Z C, LI W L, SUN X Y, et al Improved YOLOv5-based radar object detection. Journal of Systems Engineering and Electronics, 2025, 36 (4): 932- 939.
33	DU H, WANG W, WANG X R, et al Scene image recognition with knowledge transfer for drone navigation. Journal of Systems Engineering and Electronics, 2023, 34 (5): 1309- 1318.

Parameter	Experimental configuration
Operating system	Ubuntu 22.04.1 LTS
CPU	24 vCPU Intel(R) Core(TM) i9-14900KS CPU @ 3.20 GHz
GPU	NVIDIA GeForce RTX 5090*2
Programming language	Python 3.9.23
Deep leaming framework	PyTorch 2.8.0-cuda12.9
CUDA	12.9
RAM	256G

Method	mAP	mATE	mASE	mAOE	mAVE	mAAE	NDS	FPS
BEVDet-base	0.424	0.524	0.242	0.373	0.950	0.148	0.488	1.9
BEVDet4D-base	0.451	0.511	0.241	0.386	0.301	0.121	0.569	1.9
BEVFormer	0.481	0.582	0.256	0.375	0.378	0.126	0.569	1.7
BEVDepth	0.503	0.445	0.245	0.378	0.320	0.126	0.600	9.1
VCD-A	0.609	0.397	0.244	0.305	0.238	0.139	0.672	−
SparseBEV	0.603	0.425	0.239	0.311	0.172	0.116	0.675	−
StreamPETR-Large	0.620	0.470	0.241	0.258	0.236	0.134	0.676	6.4
RoPETR	0.648	0.379	0.227	0.248	0.174	0.125	0.709	−
The proposed method	0.616	0.311	0.255	0.365	0.265	0.133	0.675	13.1

Method	Modality	mAP	NDS	FPS	GFLOPs
BEVFusion	L+C	0.702	0.729	8.4	506.4
EA-LSS*	L+C	0.766	0.776	8.1	−
BEVFusion4D	L+C	0.768	0.772	2.0	−
SparseLIF	L+C	0.759	0.777	2.9	−
MV2DFusion	L+C	0.779	0.788	5.5	−
The proposed method	C	0.616	0.675	13.1	133.2

$ {L}_\text{spatial} $	$ {L}_\text{temporal} $	mAP	NDS	mAVE
−	−	0.469	0.522	0.538
√	−	0.613	0.647	0.356
−	√	0.531	0.571	0.334
√	√	0.616	0.675	0.265

A lightweight pure visual BEV perception method based on dual distillation of spatial-temporal knowledge

RichHTML

PDF (PC)

Knowledge

Abstract

Cite this article

Share this article

Figures/Tables 7

References 33

Related Articles 1

Recommended Articles

Metrics

Comments