Journal of Systems Engineering and Electronics ›› 2026, Vol. 37 ›› Issue (1): 84-93.doi: 10.23919/JSEE.2025.000165

• ELECTRONICS TECHNOLOGY • Previous Articles     Next Articles

Enhancing convolution for Transformer-based weakly supervised semantic segmentation

Yu LIU, Diaoyin TAN(), Wen ZHOU(), Huaxin XIAO()   

  • Received:2022-12-24 Accepted:2025-11-08 Online:2026-02-18 Published:2026-03-09
  • Contact: Huaxin XIAO E-mail:704985427@qq.com;zhouwen@nudt.edu.cn;xiaohuaxin@nudt.edu.cn
  • About author:
    LIU Yu was born in 1983. He received his B.S. degree from Northwestern Polytechnical University, Xi’an, China in 2005. He then received his M.S. degree on image processing and Ph.D. on computer graphics from the University of East Anglia, Norwich, UK, in 2007 and 2011, respectively. He is currently a professor in the College of Systems Engineering, National University of Defense Technology. His research interests include image/video processing, computer graphics, and visual haptic technology. E-mail: jasonyuliu@nudt.edu.cn

    TAN Diaoyin was born in 1998. He received his Ph.D. degree from the National University of Defense Science and Technology in 2022. He is currently working as an assistant engineer at the Aerospace Science and Technology Corporation. His research interests are artificial intelligence and computational vision. E-mail: 704985427@qq.com

    ZHOU Wen was born in 1984. He received his Ph.D. degree in management science and engineering from Harbin Engineering University in 2015. He is currently an assistant professor with the College of Systems Engineering, National University of Defense Technology, Changsha, China. His main research interests are information systems, and complex data analysis. E-mail: zhouwen@nudt.edu.cn

    XIAO Huaxin was born in 1989. He received his B.E. degree from the University of Electronic Science and Technology of China, China in 2012 and Ph.D. degree from National University of Defense Technology, China in 2018. He was a visiting student at the National University of Singapore from 2016 to 2018. He is currently a lecturer in the College of Systems Engineering at the National University of Defense Technology. He received the winner prize of object localization task in ILSVRC 2017. His current research interest focuses on saliency detection, image/video object segmentation. E-mail: xiaohuaxin@nudt.edu.cn

Abstract:

Weakly supervised semantic segmentation (WSSS) is a tricky task, which only provides category information for segmentation prediction. Thus, the key stage of WSSS is to generate the pseudo labels. For convolutional neural network (CNN) based methods, in which class activation mapping (CAM) is proposed to obtain the pseudo labels, and only concentrates on the most discriminative parts. Recently, transformer-based methods utilize attention map from the multi-headed self-attention (MHSA) module to predict pseudo labels, which usually contain obvious background noise and incoherent object area. To solve the above problems, we use the Conformer as our backbone, which is a parallel network based on convolutional neural network (CNN) and Transformer. The two branches generate pseudo labels and refine them independently, and can effectively combine the advantages of CNN and Transformer. However, the parallel structure is not close enough in the information communication. Thus, parallel structure can result in poor details about pseudo labels, and the background noise still exists. To alleviate this problem, we propose enhancing convolution CAM (ECCAM) model, which have three improved modules based on enhancing convolution, including deeper stem (DStem), convolutional feed-forward network (CFFN) and feature coupling unit with convolution (FCUConv). The ECCAM could make Conformer have tighter interaction between CNN and Transformer branches. After experimental verification, the improved modules we propose can help the network perceive more local information from images, making the final segmentation results more refined. Compared with similar architecture, our modules greatly improve the semantic segmentation performance and achieve 70.2% mean intersection over union(mIoU) on the PASCAL VOC 2012 dataset.

Key words: weakly supervised semantic segmentation, transformer, convolutional neural network