Multi-scale object detection by top-down and bottom-up feature pyramid network

doi:10.21629/JSEE.2019.01.01

Journal of Systems Engineering and Electronics ›› 2019, Vol. 30 ›› Issue (1): 1-12.doi: 10.21629/JSEE.2019.01.01

• Electronics Technology • Previous Articles Next Articles

Multi-scale object detection by top-down and bottom-up feature pyramid network

Baojun ZHAO^1,²(), Boya ZHAO^1,²(), Linbo TANG^1,^2,*(), Wenzheng WANG^1,²(), Chen WU^1,²()

¹ School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
² Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China

Received:2018-05-08 Online:2019-02-27 Published:2019-02-26
Contact: Linbo TANG E-mail:zbj@bit.edu.cn;zhaoboya@bit.edu.cn;tanglinbo@bit.edu.cn;wwz@bit.edu.cn;wuchen@gmail.com
About author:ZHAO Baojun was born in 1960. He received his Ph.D. degree in electromagnetic measurement technology and equipment from Harbin Institute of Technology (HIT), Harbin, China, in 1996. From 1996 to 1998, he was a postdoctoral fellow at Beijing Institute of Technology (BIT), Beijing, China. Since 1998, he has been engaged in teaching and research work at Radar Research Laboratory, BIT. His main research interests include image/video coding, image recognition, infrared/laser signal processing, and parallel signal processing. E-mail:zbj@bit.edu.cn|ZHAO Boya was born in 1990. He received his B.Sc. degree from the School of Electrical Engineering and Information, Hebei University of Technology, Tianjin, China, in 2013. He is currently pursuing his Ph.D. degree with the School of Electrical and Information Engineering, Beijing Institute of Technology, Beijing, China. His current research interests include object detection, object tracking and machine learning. E-mail:zhaoboya@bit.edu.cn|TANG Linbo was born in 1978. He received his B.Sc. degree in resources exploration engineering from Changchun University of Science and Technology, Changchun, Chain. Then, he received his M.Sc. degree in radio physics from China University of Petroleum, Beijing, Chain. At last, he received his Ph.D. degree from the School of Electrical Engineering and Information, Hebei University of Technology, Tianjin, China, in 2005. Since 2005, he has been engaged in teaching and research work at Radar Research Laboratory, Beijing Institute of Technology. He has undertaken 863 and H863 projects. His research interests include image processing and real-time signal processing. E-mail:tanglinbo@bit.edu.cn|WANG Wenzheng was born in 1988. He received his M.Sc. degree from the School of Electrical and Information Engineering, Beijing Institute of Technology, Beijing, China, in 2014. He is currently pursuing his Ph.D. degree with the School of Electrical and Information Engineering, Beijing Institute of Technology, Beijing, China. His current research interests include hyperspectral/optical imagery target detection, feature selection and machine learning. E-mail:wwz@bit.edu.cn|WU Chen was born in 1994. He received his B.Sc. degree from the School of Electrical Engineering and Information, Xidian University, Xi'an, China, in 2017. He is currently pursuing his M.Sc. degree with the School of Electrical and Information Engineering, Beijing Institute of Technology, Beijing, China. His current research interests include object detection and machine learning. E-mail:wuchen@gmail.com
Supported by:
the Program of Introducing Talents of Discipline to Universities (111 Plan) of China(B14010);the National Natural Science Foundation of China(31727901);This work was supported by the Program of Introducing Talents of Discipline to Universities (111 Plan) of China (B14010) and the National Natural Science Foundation of China (31727901)

Abstract

Abstract:

While moving ahead with the object detection technology, especially deep neural networks, many related tasks, such as medical application and industrial automation, have achieved great success. However, the detection of objects with multiple aspect ratios and scales is still a key problem. This paper proposes a top-down and bottom-up feature pyramid network (TDBU-FPN), which combines multi-scale feature representation and anchor generation at multiple aspect ratios. First, in order to build the multi-scale feature map, this paper puts a number of fully convolutional layers after the backbone. Second, to link neighboring feature maps, top-down and bottom-up flows are adopted to introduce context information via top-down flow and supplement suboriginal information via bottom-up flow. The top-down flow refers to the deconvolution procedure, and the bottom-up flow refers to the pooling procedure. Third, the problem of adapting different object aspect ratios is tackled via many anchor shapes with different aspect ratios on each multi-scale feature map. The proposed method is evaluated on the pattern analysis, statistical modeling and computational learning visual object classes (PASCAL VOC) dataset and reaches an accuracy of 79%, which exhibits a 1.8% improvement with a detection speed of 23 fps.

Key words: convolutional neural network (CNN), feature pyramid network (FPN), object detection, deconvolution

Baojun ZHAO, Boya ZHAO, Linbo TANG, Wenzheng WANG, Chen WU. Multi-scale object detection by top-down and bottom-up feature pyramid network[J]. Journal of Systems Engineering and Electronics, 2019, 30(1): 1-12.

Figures/Tables 20

Fig 1

Fig 2

Fig 3

Fig 4

Fig 5

Table 1

Fig 6

Table 2

Fig 7

Fig 8

Fig 9

Fig 10

Table 3

Fig 11

Fig 12

Table 4

Table 5

Fig 13

Fig 14

Table 6

References 36

1	LOWE D G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60 (2): 91- 110.
2	DALAL N, TRIGGS B. Histograms of oriented gradients for human detection. Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, 886- 893.
3	RÄTSCH G, ONODA T, MÜLLER K R. Soft margins for Ad-aBoost. Machine Learning, 2001, 42 (3): 287- 320.
4	BREIMAN L. Random forests. Machine Learning, 2001, 45 (1): 5- 32.
5	SUYKENS J A K, VANDEWALLE J. Least squares support vector machine classifiers. Neural Processing Letters, 1999, 9 (3): 293- 300. doi: 10.1023/A:1018628609742
6	FELZENSZWALB P, MCALLESTER D, RAMANAN D. A discriminatively trained, multiscale, deformable part model. Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition, 2008, 1- 8.
7	ZITNICK C L, DOLLÁR P. Edge boxes:locating object proposals from edges. Proc. of the European Conference on Computer Vision, 2014, 391- 405.
8	UIJLINGS J R R, VAN DE SANDE K E A, GEVERS T, et al. Selective search for object recognition. International Journal of Computer Vision, 2013, 104 (2): 154- 171.
9	RUSSAKOVSKY O, DENG J, SU H, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115 (3): 211- 252.
10	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks. Proc. of the Advances in Neural Information Processing Systems, 2012, 1097- 1105.
11	LECUN Y, BOSER B, DENKER J S, et al. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1 (4): 541- 551. doi: 10.1162/neco.1989.1.4.541
12	RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors. Nature, 1986, 323 (6088): 533. doi: 10.1038/323533a0
13	GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, 580- 587.
14	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition. Proc. of the International Conference on Learning Representations, 2015, 1- 14.
15	GIRSHICK R. Fast R-CNN. Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition, 2015, 1440- 1448.
16	REN S, HE K, GIRSHICK R, et al. Faster R-CNN:towards real-time object detection with region proposal networks. Proc. of the Advances in Neural Information Processing Systems, 2015, 91- 99.
17	REDMON J, DIVVALA S, GIRSHICK R, et al. You Only Look Once:unified, real-time object detection. Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition, 2016, 779- 788.
18	LIU W, ANGUELOV D, ERHAN D, et al. SSD:single shot multibox detector. Proc. of the European Conference on Computer Vision, 2016, 21- 37.
19	FU C Y, LIU W, RANGA A, et al. DSSD:deconvolutional single shot detector. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 1- 11.
20	SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, 1- 9.
21	BELL S, LAWRENCE ZITNICK C, BALA K, et al. Insideoutside net:detecting objects in context with skip pooling and recurrent neural networks. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 2874- 2883.
22	DAI J, LI Y, HE K, et al. R-FCN:object detection via regionbased fully convolutional networks. Proc. of the Advances in Neural Information Processing Systems, 2016, 379- 387.
23	HONG S, ROH B, KIM K H, et al. PVANet:lightweight deep neural networks for real-time object detection. Proc. of the Conference and Workshop on Neural Information Processing Systems, 2016, 1- 7.
24	LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 2117- 2125.
25	NAIR V, HINTON G E. Rectified linear units improve restricted boltzmann machines. Proc. of the 27th International Conference on Machine Learning, 2010, 807- 814.
26	IOFFE S, SZEGEDY C. Batch normalization:accelerating deep network training by reducing internal covariate shift. Proc. of the International Conference on Machine Learning, 2015, 448- 456.
27	HU P, RAMANAN D. Finding tiny faces. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 1522- 1530.
28	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation. IEEE Trans. on Pattern Analysis & Machine Intelligence, 2014, 39 (4): 640- 651.
29	ERHAN D, SZEGEDY C, TOSHEV A, et al. Scalable object detection using deep neural networks. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, 2147- 2154.
30	SZEGEDY C, REED S, ERHAN D, et al. Scalable, highquality object detection. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, 1- 8.
31	SHRIVASTAVA A, GUPTA A, GIRSHICK R. Training region-based object detectors with online hard example mining. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 761- 769.
32	HUBER P J. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 1964, 35 (1): 73- 101. doi: 10.1214/aoms/1177703732
33	EVERINGHAM M, VAN GOOL L, WILLIAMS C K I, et al. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 2010, 88 (2): 303- 338.
34	YU F, KOLTUN V. Multi-scale context aggregation by dilated convolutions. Proc. of the International Conference on Learning Recognition, 2016, 1- 13.
35	GLOROT X, BENGIO Y. Understanding the difficulty of training deep feedforward neural networks. Proc. of the 13th International Conference on Artificial Intelligence and Statistics, 2010, 249- 256.
36	HE K, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN. Proc. of the IEEE International Conference on Computer Vision, 2017, 2980- 2988.

Feature	Height	Width	Number
Conv3_3	0.035 4, 0.028 9, 0.070 7	0.070 7, 0.086 6, 0.035 4	33 750
Conv3_3	0.866 0, 0.050 0, 0.070 7	0.028 9, 0.050 0, 0.070 7	33 750
Conv4_3	0.070 7, 0.057 7, 0.141 4	0.141 4, 0.173 2, 0.070 7	8 864
Conv4_3	0.173 2, 0.100 0, 0.141 4	0.057 7, 0.100 0, 0.141 4	8 864
Conv6_2	0.141 4, 0.115 5, 0.282 8	0.282 8, 0.346 4, 0.141 4	2 166
Conv6_2	0.346 4, 0.200 0, 0.278 4	0.115 5, 0.200 0, 0.278 4	2 166
Conv7_2	0.274 0, 0.223 7, 0.548 0	0.548 0, 0.671 2, 0.274 0	600
Conv7_2	0.671 2, 0.387 5, 0.472 0	0.223 7, 0.387 5, 0.472 0	600
Conv8_2	0.406 6, 0.332 0, 0.813 20.995 9, 0.575 0, 0.662 1	0.813 2, 0.995 9, 0.406 60.332 0, 0.575 0, 0.662 11.078 3, 1.320 7, 0.539 2	150
Conv9_2	0.539 2, 0.440 2, 1.078 3	1.078 3, 1.320 7, 0.539 2	54
Conv9_2	1.320 7, 0.762 5, 0.851 1	0.440 2, 0.762 5, 0.851 1	54
Conv10_2	0.671 8, 0.548 5, 1.343 5	1.343 5, 1.645 4, 0.671 8	6
Conv10_2	1.645 4, 0.950 0, 1.039 5	0.548 5, 0.950 0, 1.039 5	6

Feature	Confidence kernel	Location kernel %number
Conv3_3	$3{\times} 3{\times} 1 024{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 1 024{\times} 4{\times} (n{+}1)$
Conv4_3	$3{\times} 3{\times} 2 048{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 2 048{\times} 4{\times} (n{+}1)$
Conv6_2	$3{\times} 3{\times} 2 048{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 2 048{\times} 4{\times} (n{+}1)$
Conv7_2	$3{\times} 3{\times} 1 792{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 1 792{\times} 4{\times} (n{+}1)$
Conv8_2	$3{\times} 3{\times} 1 024{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 1 024{\times} 4{\times} (n{+}1)$
Conv9_2	$3{\times} 3{\times} 768{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 768{\times} 4{\times} (n{+}1)$
Conv10_2	$3{\times} 3{\times} 512{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 512{\times} 4{\times} (n{+}1)$

Class	Faster R-CNN	ION	RFCN	SSD 300	MR-CNN	TDBU-FPN
Aeroplane	76.5	79.2	79.0	81.0	80.3	82.6
Bicycle	79.0	83.1	80.3	84.2	84.1	84.5
Bird	70.9	77.6	76.6	76.7	78.5	78.6
Boat	65.5	65.6	67.0	72.1	70.8	75.9
Bottle	52.1	54.9	63.7	51.7	68.5	61.5
Bus	83.1	85.4	84.8	86.1	88.0	85.5
Car	84.7	85.1	85.6	86.1	85.9	86.9
Cat	86.4	87.0	89.1	85.0	87.8	85.5
Chair	52.0	54.4	62.2	63.0	60.3	64.1
Cow	81.9	80.6	85.3	82.0	85.2	81.6
Dining_table	65.7	73.8	67.9	76.9	73.7	78.1
Dog	84.8	85.3	87.3	85.5	87.2	86.7
Horse	84.6	82.2	86.6	87.3	86.5	88.5
Motorbike	77.5	82.2	82.8	84.8	85.0	85.2
Person	76.7	74.4	79.0	78.8	76.4	50.1
Potted_plant	38.8	47.1	51.0	50.4	48.5	58.1
Sheep	73.6	75.8	77.6	77.2	76.3	78.8
Sofa	73.9	72.7	75.2	80.2	75.5	78.7
Tvmonitor	83.0	84.2	83.5	87.6	85.0	88.5
Train	72.6	80.4	76.5	76.2	81.0	77.8
mAP	73.2	75.6	77.0	77.2	78.2	79.0

Setting	Number of anchor	Detection speed	MAP/%
Anchor-6	45 590	17	79.1
Anchor-4	30 256	24	78.1
Anchor-4 and Anchor-6	31 252	23	79.0

Setting	Anchor number	Detection speed	MAP/%
Feature-7	31 252	23	79.0
Feature-6	31 262	23	78.8
Feature-5	31 192	23	78.4
Feature-4	31 042	24	76.4

Multi-scale object detection by top-down and bottom-up feature pyramid network

RichHTML

PDF (PC)

Knowledge

Abstract

Cite this article

Share this article

Figures/Tables 20

References 36

Related Articles 15

Recommended Articles

Metrics

Comments

[1]	Jun HAN, Weixing LI, Kai FENG, Feng PAN. Vision-based aerial image mosaicking algorithm with object detection [J]. Journal of Systems Engineering and Electronics, 2022, 33(2): 259-268.
[2]	Zhengliang ZHU, Degui YANG, Junchao ZHANG, Feng TONG. Dataset of human motion status using IR-UWB through-wall radar [J]. Journal of Systems Engineering and Electronics, 2021, 32(5): 1083-1096.
[3]	Tao YE, Zongyang ZHAO, Jun ZHANG, Xinghua CHAI, Fuqiang ZHOU. Low-altitude small-sized object detection using lightweight feature-enhanced convolutional neural network [J]. Journal of Systems Engineering and Electronics, 2021, 32(4): 841-853.
[4]	Wantian WANG, Ziyue TANG, Yichang CHEN, Yongjian SUN. Parity recognition of blade number and manoeuvre intention classification algorithm of rotor target based on micro-Doppler features using CNN [J]. Journal of Systems Engineering and Electronics, 2020, 31(5): 884-889.
[5]	Binquan LI, Xiaohui HU. Effective distributed convolutional neural network architecture for remote sensing images target classification with a pre-training approach [J]. Journal of Systems Engineering and Electronics, 2019, 30(2): 238-244.
[6]	Jinbo CHEN, Zhiheng WANG, Hengyu LI. Real-time object segmentation based on convolutional neural network with saliency optimization for picking [J]. Journal of Systems Engineering and Electronics, 2018, 29(6): 1300-1307.
[7]	Liangkui LIN, Shaoyou WANG, Zhongxing TANG. Using deep learning to detect small targets in infrared oversampling images [J]. Journal of Systems Engineering and Electronics, 2018, 29(5): 947-952.
[8]	Xiaoping Shi, Rui Guo, Yi Zhu, and Zicai Wang. Astronomical image restoration using variational Bayesian blind deconvolution#br# [J]. Journal of Systems Engineering and Electronics, 2017, 28(6): 1236-1247.
[9]	Bendong Zhao, Huanzhang Lu, Shangfeng Chen, Junliang Liu, and Dongya Wu. Convolutional neural networks for time series classification [J]. Systems Engineering and Electronics, 2017, 28(1): 162-.
[10]	Rui Yao and Yanning Zhang. Compressive sensing for small moving space object detection in astronomical images [J]. Journal of Systems Engineering and Electronics, 2012, 23(3): 378-384.
[11]	Qinkun Xiao, Nan Zhang, Fei Li, and Yue Gao. Object detection based on combination of local and spatial information [J]. Journal of Systems Engineering and Electronics, 2011, 22(4): 715-720.
[12]	Jing Li, Junzheng Wang, and Wei Shen. Moving object detection in framework of compressive sampling [J]. Journal of Systems Engineering and Electronics, 2010, 21(5): 740-745.
[13]	Xiaojun Sun and Zili Deng. Self-tuning measurement fusion white noise deconvolution estimator with correlated noises [J]. Journal of Systems Engineering and Electronics, 2010, 21(4): 666-674.
[14]	Wang Yang, Zhang Naitong, Zhang Qinyu & Zhang Zhongzhao. Deconvolution techniques for characterizing indoor UWB wireless channel [J]. Journal of Systems Engineering and Electronics, 2008, 19(4): 688-693.
[15]	Wei Zhiqiang, Ji Xiaopeng & Wang Peng. Real-time moving object detection for video monitoring systems [J]. Journal of Systems Engineering and Electronics, 2006, 17(4): 731-736.