The brief self-attention module for lightweight convolution neural networks

doi:10.23919/JSEE.2025.000051

Journal of Systems Engineering and Electronics ›› 2025, Vol. 36 ›› Issue (6): 1389-1397.doi: 10.23919/JSEE.2025.000051

• •

收稿日期:2024-08-26 接受日期:2024-08-26 出版日期:2025-12-18 发布日期:2026-01-07

The brief self-attention module for lightweight convolution neural networks

Jie YAN¹(), Yingmei WEI¹(), Yuxiang XIE¹^,*(), Quanzhi GONG¹(), Shiwei ZOU¹(), Xidao LUAN²()

¹ Laboratory for Big Data and Decision, College of Systems Engineering, National University of Defense Technology, Changsha 410000, China
² College of Computer Science and Engineering, Changsha University, Changsha 410000, China

Received:2024-08-26 Accepted:2024-08-26 Online:2025-12-18 Published:2026-01-07
Contact: Yuxiang XIE E-mail:yanjie@nudt.edu.cn;weiyingmei@nudt.edu.cn;xyx89@163.com;charles_g27@qq.com;1530531454@qq.com;xidaoluan@ccsu.cn
About author:
YAN Jie was born in 1999. She received her B.S. degree in cost engineering from Qingdao University of Technology, China, in 2016. She is pursing her Ph.D. degree with the College of Systems Engineering, National University of Defense Technology, China. Her research interests include multi-modal semantic understanding, image captioning, and image classification. E-mail: yanjie@nudt.edu.cn

WEI Yingmei was born in 1972. She received her Ph.D. degree in computer science and technology from National University of Defense Technology, China, in 2000, where she is a professor with the College of Systems Engineering. Her research interests include virtual reality, information visualization, and visual analysis techniques. E-mail: weiyingmei@nudt.edu.cn

XIE Yuxiang was born in 1976. She received her B.S., M.S. and Ph.D. degrees in management science and engineering from National University of Defense Technology in 1998, 2001 and 2004 respectively, all in the College of Systems Engineering. She is a professor in the College of Systems Engineering, National University of Defense Technology. Her research interests include computer vision, image and video analysis, classification and retrieval. E-mail: yxxie@nudt.edu.cn

GONG Quanzhi was born in 1998. He received his B.S. and M.S. degrees in control science and engineering from National University of Defense Technology, China, in 2020 and 2022. His research interests include action recognition and fine-grained image classification. E-mail: charles_g27@qq.com

ZOU Shiwei was born in 2001. She received her B.S. degree in target engineering from National University of Defense Technology, China, in 2023, where she is pursing her M.S. degree with the College of Systems Engineering, National University of Defense Technology, China. Her current research interests include multi-modal semantic understanding, remote sensing image captioning, and image classification. E-mail: zsw0915@nudt.edu.cn

LUAN Xidao was born in 1976. He received his B.S. degree in applied mathematics in 1998, M.S. and Ph.D. degrees in management science and engineering in 2005, 2009 respectively, all from National University of Defense Technology. He is a professor with the College of Computer Science and Engineering, Changsha University. His research interests include computer vision, image and video analysis, classification and retrieval. E-mail: xidaoluan@ccsu.cn

摘要/Abstract

Abstract:

Lightweight convolutional neural networks (CNNs) have simple structures but struggle to comprehensively and accurately extract important semantic information from images. While attention mechanisms can enhance CNNs by learning distinctive representations, most existing spatial and hybrid attention methods focus on local regions with extensive parameters, making them unsuitable for lightweight CNNs. In this paper, we propose a self-attention mechanism tailored for lightweight networks, namely the brief self-attention module (BSAM). BSAM consists of the brief spatial attention (BSA) and advanced channel attention blocks. Unlike conventional self-attention methods with many parameters, our BSA block improves the performance of lightweight networks by effectively learning global semantic representations. Moreover, BSAM can be seamlessly integrated into lightweight CNNs for end-to-end training, maintaining the network’s lightweight and mobile characteristics. We validate the effectiveness of the proposed method on image classification tasks using the Food-101, Caltech-256, and Mini-ImageNet datasets.

Key words: self-attention, lightweight neural network, deep learning

. [J]. Journal of Systems Engineering and Electronics, 2025, 36(6): 1389-1397.

Jie YAN, Yingmei WEI, Yuxiang XIE, Quanzhi GONG, Shiwei ZOU, Xidao LUAN. The brief self-attention module for lightweight convolution neural networks[J]. Journal of Systems Engineering and Electronics, 2025, 36(6): 1389-1397.

图/表 7

参考文献 38

1	HU J, SHEN L, ALBANIE S. Squeeze-and-excitation networks. Proc. of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2018: 7132–7141.
2	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module. Proc. of the European Conference on Computer Vision, 2018: 3–19.
3	FU J, LIU J, TIAN H J, et al. Dual attention network for scene segmentation. Proc. of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2019: 3146–3154.
4	ZHANG H, GOODFELLOW I, METAXAS D, et al. Self-attention generative adversarial networks. Proc. of the 36th International Conference on Machine Learning, 2019: 7354–7363.
5	PARK J, WOO S, LEE J Y, et al. BAM: bottleneck attention module. https://arxiv.org/abs/1807.06514v1.
6	WANG Q L, WU B G, ZHU P H, et al. ECA-Net: efficient channel attention for deep convolutional neural networks. Proc. of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2020: 11531–11539.
7	SANDLER M, HOWARD A, ZHU M, et al. MobileNetV2: inverted residuals and linear bottlenecks. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 4510–4520.
8	ZHOU D Q, HOU Q B, CHEN Y P, et al. Rethinking bottleneck structure for efficient mobile network design. Proc. of the European Conference on Computer Vision, 2020: 680–697.
9	BOSSARD L, GUILLAUMIN M, VAN GOOL L. Food-101–mining discriminative components with random forests. Proc. of the European Conference on Computer Vision, 2014: 446–461.
10	GRIFFIN G, HOLUB A, PERONA P. Caltech-256 object category dataset. California Institute of Technology, 2007: 1–20.
11	VINYALS O, BLUNDELL C, LILLICRAP T, et al. Matching networks for one shot learning. Advances in Neural Information Processing Systems, 2016, 29: 8909022.
12	IANDOLA F N, HAN S, MOSKEWICZ M W, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. https://arxiv.org/abs/1602.07360v4.
13	GHOLAMI A, KWON K, WU B, et al. SqueezeNext: hardware-aware neural network design. Proc. of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2018: 1638–1647.
14	ZHANG X Y, ZHOU X Y, LIN M X, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices. Proc. of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2018: 6848–6856.
15	HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks. Proc. of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2017: 4700–4708.
16	HOWARD A G, ZHU M, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications. https://arXiv preprint arXiv: 1704.04861.
17	HOWARD A, SANDLER M, CHU G, et al. Searching for MobileNetV3. Proc. of the IEEE/CVF International Conference on Computer Vision, 2019: 1314–1324.
18	MEHTA S, RASTEGARI M, CASPI A, et al. ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. Proc. of the European Conference on Computer Vision, 2018: 552–568.
19	HUANG G, LIU S, VAN DER M L, et al. CondenseNet: an efficient DenseNet using learned group convolutions. Proc. of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2018: 2752–2761.
20	WANG R J, LI X, LING C X. Pelee: a real-time object detection system on mobile devices. Advances in Neural Information Processing Systems, 2018: 31: 1967−1976.
21	ZOPH B, LE Q V. Neural architecture search with reinforcement learning. https://arxiv.org/abs/1611.01578.
22	TAN M X, LE Q V. EfficientNet: rethinking model scaling for convolutional neural networks. Proc. of the 36th International Conference on Machine Learning, 2019: 6105–6114.
23	CAI H, ZHU L G, HAN S. ProxylessNAS: direct neural architecture search on target task and hardware. https://arxiv.org/abs/1812.00332.
24	WU B C, DAI X L, ZHANG P Z, et al. FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 10734–10742.
25	MA N N, ZHANG X Y, HUANG J W, et al. WeightNet: revisiting the design space of weight networks. Proc. of the European Conference on Computer Vision, 2020: 776–792.
26	MNIH V, HEESS N, GRAVES A, et al. Recurrent models of visual attention. Advances in Neural Information Processing Systems, 2014: 2204-2212.
27	HUANG Z L, WANG X G, WEI Y C, et al. CCnet: criss-cross attention for semantic segmentation. Proc. of the IEEE/CVF International Conference on Computer Vision, 2019: 603–612.
28	LI X, ZHONG Z S, WU J L, et al. Expectation-maximization attention networks for semantic segmentation. Proc. of the IEEE/CVF International Conference on Computer Vision, 2019: 9167–9176.
29	ROBBINS H, MONRO S. A stochastic approximation method. The Annals of Mathematical statistics, 1951: 400–407.
30	DENG J, DONG W, SOCHER R, et al. Imagenet: a large-scale hierarchical image database. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009: 248–255.
31	GALLO I, RIA G, LANDRO N, et al. Image and text fusion for UPMC Food-101 using BERT and CNNs. Proc. of the 35th International Conference on Image and Vision Computing New Zealand, 2020. DOI: 10.1109/IVCNZ51579.2020.9290622.
32	CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations. Proc. of the International Conference on Machine Learning, 2020: 1597–1607.
33	GRILL J B, STRUB F, ALTCHÉ F, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 2020: 21271–21284.
34	GOYAL P, DUVAL Q, SEESSEL I, et al. Vision models are more robust and fair when pretrained on uncurated images without supervision. https://arxiv.org/abs/2202.08360.
35	CARON M, MISRA I, MAIRAL J, et al. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 2020: 9912–9924.
36	DWIBEDI D, AYTAR Y, TOMPSON J, et al. With a little help from my friends: nearest-neighbor contrastive learning of visual representations. Proc. of the IEEE/CVF International Conference on Computer Vision, 2021: 9588–9597.
37	LU Z, SREEKUMAR G, GOODMAN E, et al Neural architecture transfer. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2021, 43 (9): 2971- 2989. doi: 10.1109/TPAMI.2021.3052758
38	TOUVRON H, SABLAYROLLES A, DOUZE M, et al. Grafit: learning fine-grained image representations with coarse labels. Proc. of the IEEE/CVF International Conference on Computer Vision, 2021: 874–884.

Model setup	Params/×10⁶	Top-1 accuracy/%	Top-5 accuracy/%
Baseline	2.3	81.1981	94.8874
+SE	3.48	82.1149	95.1327
ECA	2.3	81.5475	95.0218
+CBAM	4.6	80.4119	94.6099
+CA	2.8	82.3525	95.2911
+BSA	2.3	81.3822	94.9267
+BSA&SE	3.48	82.4713	95.4297
+BSA&ECA	2.3	81.6515	95.1049

Model setup	Params/×10⁶	Top-1 accuracy/%	Top-5 accuracy/%
Baseline	2.55	60.2589	78.6717
+SE	3.7	60.8717	79.4920
ECA	2.55	60.1322	78.7624
+CBAM	4.8	60.6839	79.6007
+CA	3.0	60.4566	79.0967
+BSA	2.55	60.5752	79.0769
+BSA&SE	3.7	60.3775	79.0077
+BSA&ECA	2.55	61.5932	80.1542

Model setup	Params/×10⁶	Top-1 accuracy/%	Top-5 accuracy/%
Baseline	2.3	76.2917	92.1757
+SE	3.48	77.0150	92.5500
ECA	2.3	76.6167	92.2417
+CBAM	4.6	75.8500	92.3667
+CA	2.8	76.3833	92.4917
+BSA	2.3	77.0583	92.6917
+BSA&SE	3.48	77.1000	92.8333
+BSA&ECA	2.3	76.6500	92.2750

Model setup	Params/×10⁶	Top-1 accuracy/%	Top-5 accuracy/%
Baseline-1.0	2.30	76.2917	92.1757
+CBAM	4.60	75.8500	92.3667
+CA	2.79	76.3833	92.4917
+BSA	2.30	77.0583	92.6917
+BSA+SE	3.48	77.1000	92.8333
Baseline-0.75	1.48	75.4750	92.2917
+CBAM	2.76	75.1833	94.4333
+CA	1.74	75.6250	92.3417
+BSA	1.48	75.7000	92.0083
+BSA+SE	2.12	75.8083	92.4333
Baseline-0.5	0.82	73.5500	91.7250
+CBAM	1.39	74.1333	91.8417
+CA	0.94	73.9333	91.9833
+BSA	0.82	73.9500	91.7833
+BSA+SE	1.10	74.5000	92.0417

Model setup	Backbone	Params of the backbone/×10⁶	Top-1 accuracy/%
Inception V3 [31]	Inception V3	27.2	71.7
SimCLR [32]	ResNet-50	25.5	72.8
BYOL [33]	ResNet-50	25.5	75.3
SEER [34]	RegNet-8gf	42	76.2
SwAV [35]	ResNet-50	25.5	76.4
NNCLR [36]	ResNet-50	25.5	76.7
NAT-M2 [37]	NAT-M2	4.1	88.5
Grafit [38]	ResNet-50	25.5	89.5
EfficientNet B7 [22]	EfficientNet-B7	66	93
EfficientNet B0+BSA+SE	EfficientNet B0	5.3	83.4

The brief self-attention module for lightweight convolution neural networks

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 38

相关文章 0

编辑推荐

Metrics

本文评价