Enhancing convolution for Transformer-based weakly supervised semantic segmentation

doi:10.23919/JSEE.2025.000165

Journal of Systems Engineering and Electronics ›› 2026, Vol. 37 ›› Issue (1): 84-93.doi: 10.23919/JSEE.2025.000165

• ELECTRONICS TECHNOLOGY • Previous Articles Next Articles

Enhancing convolution for Transformer-based weakly supervised semantic segmentation

Yu LIU, Diaoyin TAN(), Wen ZHOU(), Huaxin XIAO()

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

Received:2022-12-24 Accepted:2025-11-08 Online:2026-02-18 Published:2026-03-09
Contact: Huaxin XIAO E-mail:704985427@qq.com;zhouwen@nudt.edu.cn;xiaohuaxin@nudt.edu.cn
About author:
LIU Yu was born in 1983. He received his B.S. degree from Northwestern Polytechnical University, Xi’an, China in 2005. He then received his M.S. degree on image processing and Ph.D. on computer graphics from the University of East Anglia, Norwich, UK, in 2007 and 2011, respectively. He is currently a professor in the College of Systems Engineering, National University of Defense Technology. His research interests include image/video processing, computer graphics, and visual haptic technology. E-mail: jasonyuliu@nudt.edu.cn

TAN Diaoyin was born in 1998. He received his Ph.D. degree from the National University of Defense Science and Technology in 2022. He is currently working as an assistant engineer at the Aerospace Science and Technology Corporation. His research interests are artificial intelligence and computational vision. E-mail: 704985427@qq.com

ZHOU Wen was born in 1984. He received his Ph.D. degree in management science and engineering from Harbin Engineering University in 2015. He is currently an assistant professor with the College of Systems Engineering, National University of Defense Technology, Changsha, China. His main research interests are information systems, and complex data analysis. E-mail: zhouwen@nudt.edu.cn

XIAO Huaxin was born in 1989. He received his B.E. degree from the University of Electronic Science and Technology of China, China in 2012 and Ph.D. degree from National University of Defense Technology, China in 2018. He was a visiting student at the National University of Singapore from 2016 to 2018. He is currently a lecturer in the College of Systems Engineering at the National University of Defense Technology. He received the winner prize of object localization task in ILSVRC 2017. His current research interest focuses on saliency detection, image/video object segmentation. E-mail: xiaohuaxin@nudt.edu.cn

Abstract

Abstract:

Weakly supervised semantic segmentation (WSSS) is a tricky task, which only provides category information for segmentation prediction. Thus, the key stage of WSSS is to generate the pseudo labels. For convolutional neural network (CNN) based methods, in which class activation mapping (CAM) is proposed to obtain the pseudo labels, and only concentrates on the most discriminative parts. Recently, transformer-based methods utilize attention map from the multi-headed self-attention (MHSA) module to predict pseudo labels, which usually contain obvious background noise and incoherent object area. To solve the above problems, we use the Conformer as our backbone, which is a parallel network based on convolutional neural network (CNN) and Transformer. The two branches generate pseudo labels and refine them independently, and can effectively combine the advantages of CNN and Transformer. However, the parallel structure is not close enough in the information communication. Thus, parallel structure can result in poor details about pseudo labels, and the background noise still exists. To alleviate this problem, we propose enhancing convolution CAM (ECCAM) model, which have three improved modules based on enhancing convolution, including deeper stem (DStem), convolutional feed-forward network (CFFN) and feature coupling unit with convolution (FCUConv). The ECCAM could make Conformer have tighter interaction between CNN and Transformer branches. After experimental verification, the improved modules we propose can help the network perceive more local information from images, making the final segmentation results more refined. Compared with similar architecture, our modules greatly improve the semantic segmentation performance and achieve 70.2% mean intersection over union（mIoU） on the PASCAL VOC 2012 dataset.

Key words: weakly supervised semantic segmentation, transformer, convolutional neural network

Yu LIU, Diaoyin TAN, Wen ZHOU, Huaxin XIAO. Enhancing convolution for Transformer-based weakly supervised semantic segmentation[J]. Journal of Systems Engineering and Electronics, 2026, 37(1): 84-93.

Figures/Tables 5

Fig 1

Fig 2

Fig 3

Table 1

Table 2

References 45

1	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3431–3440.
2	CHEN L C, PAPANDREOU G, KOKKINOS I, et al Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2017, 40 (4): 834- 848.
3	ZHAO H S, SHI J P, QI X J, et al. Pyramid scene parsing network. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2881–2890.
4	HE J J, DENG Z Y, ZHOU L, et al. Adaptive pyramid context network for semantic segmentation. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 7519–7528.
5	ZHOU B, KHOSLA A, LAPEDRIZA A, et al. Learning deep features for discriminative localization. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 2921–2929.
6	AHN J, CHO S, KWAK S. Weakly supervised learning of instance segmentation with inter-pixel relations. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 2209–2218.
7	WANG Y D, ZHANG J, KAN M, et al. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 12275–12284.
8	WEI Y C, LIANG X D, CHEN Y P, et al Stc: a simple to complex framework for weakly-supervised semantic segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2016, 39 (11): 2314- 2320.
9	LEE J, KIM E, LEE S, et al. Ficklenet: weakly and semi-supervised semantic image segmentation using stochastic inference. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 5267–5276.
10	KIRILLOV A , GIRSHICK R , HE K ,et al.Panoptic feature pyramid networks.IEEE, 2019. DOI: 10.1109/CVPR.2019.00656.
11	ZHOU T , ZHANG M , ZHAO F ,et al.Regional semantic contrast and aggregation for weakly supervised semantic segmentation. 2022.DOI:10.48550/arXiv.2203.09653.
12	WEI Y C, FENG J S, LIANG X D, et al. Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 1568–1576.
13	HOU Q B, JIANG P T, WEI Y C, et al. Self-erasing network for integral object attention. Advances in Neural Information Processing Systems, 2018. DOI:10.48550/arXiv.1810.09821.
14	JIANG P T, HOU Q B, CAO Y, et al. Integral object mining via online attention accumulation. Proc. of the IEEE/CVF International Conference on Computer Vision, 2019: 2070–2079.
15	ZHANG F, GU C C, ZHANG C Y, et al. Complementary patch for weakly supervised semantic segmentation. Proc. of the IEEE/CVF International Conference on Computer Vision, 2021: 7242–7251.
16	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need. https://arxiv.org/abs/1706.03762.
17	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale. https://arxiv.org/abs/2010.11929.
18	LI R, MAI Z, TRABELSI C, et al. TransCAM: transformer attention-based CAM refinement for weakly supervised semantic segmentation. Journal of Visual Communication and Image Representation. 2023, 92: 103800.
19	XU L, OUYANG W, BENNAMOUN M, et al. Multi-class token Transformer for weakly supervised semantic segmentation. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 4310–4319.
20	RU L X, ZHAN Y B, YU B S, et al. Learning affinity from attention: end-to-end weakly-supervised semantic segmentation with Transformers. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 16846–16855.
21	PENG Z L, HUANG W, GU S Z, et al. Conformer: local features coupling global representations for visual recognition. Proc. of the IEEE/CVF International Conference on Computer Vision, 2021: 367–376.
22	HOU Q B, JIANG P T, WEI Y C, et al. Self-erasing network for integral object attention. https://arxiv.org/abs/1810.09821.
23	ZHANG X L, WEI Y C, FENG J S, et al. Adversarial complementary learning for weakly supervised object localization. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 1325–1334.
24	KIM B, HAN S, KIM J. Discriminative region suppression for weakly-supervised semantic segmentation. Proc. of the AAAI Conference on Artificial Intelligence, 2021: 1754–1761.
25	GAO W, WAN F, PAN X J, et al. TS-CAM: token semantic coupled attention map for weakly supervised object localization. Proc. of the IEEE/CVF International Conference on Computer Vision, 2021: 2886–2895.
26	CHEN Z W, WANG C G, WANG Y B, et al LCTR: on awakening the local continuity of transformer for weakly supervised object localization. Proc. of the AAAI Conference on Artificial Intelligence, 2022, 36 (1): 410- 418. doi: 10.1609/aaai.v36i1.19918
27	SRINIVAS A, LIN T Y, PARMAR N, et al. Bottleneck transformers for visual recognition. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 16519–16529.
28	ZHAO Y C, WANG G T, TANG C X, et al. A battle of network structures: an empirical study of CNN, transformer, and MLP. https://arxiv.org/abs/2108.13002.
29	GUO J Y, HAN K, WU H, et al. CMT: convolutional neural networks meet vision transformers. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 12175–12185.
30	LI Y H, YAO T, PAN Y W, et al Contextual transformer networks for visual recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2023, 45 (2): 1489- 1500.
31	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770–778.
32	HENDRYCKS D, GIMPEL K. Gaussian error linear units (GELUs). https://arxiv.org/abs/1606.08415.
33	SIFRE L, MALLAT S. Rigid-motion scattering for texture classification. https://arxiv.org/abs/1403.1687.
34	EVERINGHAM M, VAN GOOL L, WILLIAMS C K I, et al The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 2010, 88 (2): 303- 338. doi: 10.1007/s11263-009-0275-4
35	HARIHARAN B, ARBELAEZ P, BOURDEV L, et al. Semantic contours from inverse detectors. Proc. of the International Conference on Computer Vision, 2011: 991–998.
36	RUSSAKOVSKY O, DENG J, SU H, et al Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115 (3): 211- 252. doi: 10.1007/s11263-015-0816-y
37	LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization. https://arxiv.org/abs/1711.05101.
38	AHN J, KWAK S. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 4981–4990.
39	KRAHENBUHL P, KOLTUN V. Efficient inference in fully connected CRFs with Gaussian edge potentials. Proc. of the 25th International Conference on Neural Information Processing Systems, 2011: 109−117.
40	OH S J, BENENSON R, KHOREVA A, et al. Exploiting saliency for object segmentation from image level labels. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 5038–5047.
41	ZHANG D, ZHANG H W, TANG J H, et al Causal intervention for weakly-supervised semantic segmentation. Advances in Neural Information Processing Systems, 2020, 33, 655- 666.
42	FAN J S, ZHANG Z X, SONG C F, et al. Learning integral objects with intra-class discriminator for weakly-supervised semantic segmentation. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 4283–4292.
43	LEE J, KIM E, YOON S. Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 4071–4080.
44	XU L, OUYANG W, BENNAMOUN M, et al. Leveraging auxiliary tasks with affinity learning for weakly supervised semantic segmentation. Proc. of the IEEE/CVF International Conference on Computer Vision, 2021: 6984–6993.
45	LI Y, DUAN Y Q, KUANG Z H, et al Uncertainty estimation via response scaling for pseudo-mask noise mitigation in weakly-supervised semantic segmentation. Proc. of the AAAI Conference on Artificial Intelligence, 2022, 1447- 1455. doi: 10.1609/aaai.v36i2.20034

Method	Backbone	Supervision	Validation
IRN_CVPR19[6]	ResNet50	I	63.5
SEAM_CVPR20[7]	ResNet38	I	64.5
CONTA_NIPS20[41]	ResNet38	I	66.1
ICD_CVPR20[42]	ResNet101	I+S	67.8
AdvCAM_CVPR21[43]	ResNet101	I	68.1
CPN_CVPR21[15]	ResNet38	I	67.8
AuxSegNet_ICCV21[44]	ResNet38	I+S	69.0
URN_AAAI22[45]	ResNet38	I	69.4
TransCAM_*	ResNet38	I	68.3
ECCAM	ResNet38	I	70.2(+1.9)

Baseline	DStem	CFFN	FCUConv	mIoU/%
√	−	−	−	68.3
√	√	−	−	68.5(+0.2)
√	−	√	−	68.8(+0.5)
√	−	−	√	68.8(+0.5)
√	√	−	√	69.4(+1.1)
√	√	√	−	69.9(+1.6)
√	√	√	√	70.2(+1.9)

[1]	Meng SUN, Qingfeng JING, Weizhi ZHONG. Deep residual systolic network for massive MIMO channel estimation by joint training strategies of mixed-SNR and mixed-scenarios [J]. Journal of Systems Engineering and Electronics, 2025, 36(4): 903-913.
[2]	Yuxiang XIE, Quanzhi GONG, Xidao LUAN, Jie YAN, Jiahui ZHANG. A survey of fine-grained visual categorization based on deep learning [J]. Journal of Systems Engineering and Electronics, 2024, 35(6): 1337-1356.
[3]	Ruihui PENG, Xingrui WU, Guohong WANG, Dianxing SUN, Zhong YANG, Hongwen LI. Intelligent recognition and information extraction of radar complex jamming based on time-frequency features [J]. Journal of Systems Engineering and Electronics, 2024, 35(5): 1148-1166.
[4]	Cong XU, Zishu HE, Haicheng LIU. A lightweight false alarm suppression method in heterogeneous change detection [J]. Journal of Systems Engineering and Electronics, 2024, 35(4): 899-905.
[5]	Jinyang CHEN, Xuhua WANG, Xian CHEN. Track correlation algorithm based on CNN-LSTM for swarm targets [J]. Journal of Systems Engineering and Electronics, 2024, 35(2): 417-429.
[6]	Hao DU, Wei WANG, Xuerao WANG, Jingqiu ZUO, Yuanda WANG. Scene image recognition with knowledge transfer for drone navigation [J]. Journal of Systems Engineering and Electronics, 2023, 34(5): 1309-1318.
[7]	Chaopeng YU, Wei XIONG, Xiaoqing LI, Lei DONG. Deep convolutional neural network for meteorology target detection in airborne weather radar images [J]. Journal of Systems Engineering and Electronics, 2023, 34(5): 1147-1157.
[8]	Qihai YAO, Yong WANG, Yixin YANG. Range estimation of few-shot underwater sound source in shallow water based on transfer learning and residual CNN [J]. Journal of Systems Engineering and Electronics, 2023, 34(4): 839-850.
[9]	Hao DU, Wei WANG, Xuerao WANG, Yuanda WANG. Autonomous landing scene recognition based on transfer learning for drones [J]. Journal of Systems Engineering and Electronics, 2023, 34(1): 28-35.
[10]	Zhengliang ZHU, Degui YANG, Junchao ZHANG, Feng TONG. Dataset of human motion status using IR-UWB through-wall radar [J]. Journal of Systems Engineering and Electronics, 2021, 32(5): 1083-1096.
[11]	Chuan LIN, Qing CHANG, Xianxu LI. Uplink NOMA signal transmission with convolutional neural networks approach [J]. Journal of Systems Engineering and Electronics, 2020, 31(5): 890-898.
[12]	Wantian WANG, Ziyue TANG, Yichang CHEN, Yongjian SUN. Parity recognition of blade number and manoeuvre intention classification algorithm of rotor target based on micro-Doppler features using CNN [J]. Journal of Systems Engineering and Electronics, 2020, 31(5): 884-889.
[13]	Binquan LI, Xiaohui HU. Effective distributed convolutional neural network architecture for remote sensing images target classification with a pre-training approach [J]. Journal of Systems Engineering and Electronics, 2019, 30(2): 238-244.
[14]	Baojun ZHAO, Boya ZHAO, Linbo TANG, Wenzheng WANG, Chen WU. Multi-scale object detection by top-down and bottom-up feature pyramid network [J]. Journal of Systems Engineering and Electronics, 2019, 30(1): 1-12.
[15]	Jinbo CHEN, Zhiheng WANG, Hengyu LI. Real-time object segmentation based on convolutional neural network with saliency optimization for picking [J]. Journal of Systems Engineering and Electronics, 2018, 29(6): 1300-1307.

Enhancing convolution for Transformer-based weakly supervised semantic segmentation

RichHTML

PDF (PC)

Knowledge

Abstract

Cite this article

Share this article

Figures/Tables 5

References 45

Related Articles 15

Recommended Articles

Metrics

Comments