Multi-scale object detection by top-down and bottom-up feature pyramid network

doi:10.21629/JSEE.2019.01.01

Journal of Systems Engineering and Electronics ›› 2019, Vol. 30 ›› Issue (1): 1-12.doi: 10.21629/JSEE.2019.01.01

收稿日期:2018-05-08 出版日期:2019-02-27 发布日期:2019-02-26

Multi-scale object detection by top-down and bottom-up feature pyramid network

Baojun ZHAO^1,²(), Boya ZHAO^1,²(), Linbo TANG^1,^2,*(), Wenzheng WANG^1,²(), Chen WU^1,²()

¹ School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
² Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China

Received:2018-05-08 Online:2019-02-27 Published:2019-02-26
Contact: Linbo TANG E-mail:zbj@bit.edu.cn;zhaoboya@bit.edu.cn;tanglinbo@bit.edu.cn;wwz@bit.edu.cn;wuchen@gmail.com
About author:ZHAO Baojun was born in 1960. He received his Ph.D. degree in electromagnetic measurement technology and equipment from Harbin Institute of Technology (HIT), Harbin, China, in 1996. From 1996 to 1998, he was a postdoctoral fellow at Beijing Institute of Technology (BIT), Beijing, China. Since 1998, he has been engaged in teaching and research work at Radar Research Laboratory, BIT. His main research interests include image/video coding, image recognition, infrared/laser signal processing, and parallel signal processing. E-mail:zbj@bit.edu.cn|ZHAO Boya was born in 1990. He received his B.Sc. degree from the School of Electrical Engineering and Information, Hebei University of Technology, Tianjin, China, in 2013. He is currently pursuing his Ph.D. degree with the School of Electrical and Information Engineering, Beijing Institute of Technology, Beijing, China. His current research interests include object detection, object tracking and machine learning. E-mail:zhaoboya@bit.edu.cn|TANG Linbo was born in 1978. He received his B.Sc. degree in resources exploration engineering from Changchun University of Science and Technology, Changchun, Chain. Then, he received his M.Sc. degree in radio physics from China University of Petroleum, Beijing, Chain. At last, he received his Ph.D. degree from the School of Electrical Engineering and Information, Hebei University of Technology, Tianjin, China, in 2005. Since 2005, he has been engaged in teaching and research work at Radar Research Laboratory, Beijing Institute of Technology. He has undertaken 863 and H863 projects. His research interests include image processing and real-time signal processing. E-mail:tanglinbo@bit.edu.cn|WANG Wenzheng was born in 1988. He received his M.Sc. degree from the School of Electrical and Information Engineering, Beijing Institute of Technology, Beijing, China, in 2014. He is currently pursuing his Ph.D. degree with the School of Electrical and Information Engineering, Beijing Institute of Technology, Beijing, China. His current research interests include hyperspectral/optical imagery target detection, feature selection and machine learning. E-mail:wwz@bit.edu.cn|WU Chen was born in 1994. He received his B.Sc. degree from the School of Electrical Engineering and Information, Xidian University, Xi'an, China, in 2017. He is currently pursuing his M.Sc. degree with the School of Electrical and Information Engineering, Beijing Institute of Technology, Beijing, China. His current research interests include object detection and machine learning. E-mail:wuchen@gmail.com
Supported by:
the Program of Introducing Talents of Discipline to Universities (111 Plan) of China(B14010);the National Natural Science Foundation of China(31727901);This work was supported by the Program of Introducing Talents of Discipline to Universities (111 Plan) of China (B14010) and the National Natural Science Foundation of China (31727901)

摘要/Abstract

Abstract:

While moving ahead with the object detection technology, especially deep neural networks, many related tasks, such as medical application and industrial automation, have achieved great success. However, the detection of objects with multiple aspect ratios and scales is still a key problem. This paper proposes a top-down and bottom-up feature pyramid network (TDBU-FPN), which combines multi-scale feature representation and anchor generation at multiple aspect ratios. First, in order to build the multi-scale feature map, this paper puts a number of fully convolutional layers after the backbone. Second, to link neighboring feature maps, top-down and bottom-up flows are adopted to introduce context information via top-down flow and supplement suboriginal information via bottom-up flow. The top-down flow refers to the deconvolution procedure, and the bottom-up flow refers to the pooling procedure. Third, the problem of adapting different object aspect ratios is tackled via many anchor shapes with different aspect ratios on each multi-scale feature map. The proposed method is evaluated on the pattern analysis, statistical modeling and computational learning visual object classes (PASCAL VOC) dataset and reaches an accuracy of 79%, which exhibits a 1.8% improvement with a detection speed of 23 fps.

Key words: convolutional neural network (CNN), feature pyramid network (FPN), object detection, deconvolution

. [J]. Journal of Systems Engineering and Electronics, 2019, 30(1): 1-12.

Baojun ZHAO, Boya ZHAO, Linbo TANG, Wenzheng WANG, Chen WU. Multi-scale object detection by top-down and bottom-up feature pyramid network[J]. Journal of Systems Engineering and Electronics, 2019, 30(1): 1-12.

图/表 20

参考文献 36

1	LOWE D G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60 (2): 91- 110.
2	DALAL N, TRIGGS B. Histograms of oriented gradients for human detection. Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, 886- 893.
3	RÄTSCH G, ONODA T, MÜLLER K R. Soft margins for Ad-aBoost. Machine Learning, 2001, 42 (3): 287- 320.
4	BREIMAN L. Random forests. Machine Learning, 2001, 45 (1): 5- 32.
5	SUYKENS J A K, VANDEWALLE J. Least squares support vector machine classifiers. Neural Processing Letters, 1999, 9 (3): 293- 300. doi: 10.1023/A:1018628609742
6	FELZENSZWALB P, MCALLESTER D, RAMANAN D. A discriminatively trained, multiscale, deformable part model. Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition, 2008, 1- 8.
7	ZITNICK C L, DOLLÁR P. Edge boxes:locating object proposals from edges. Proc. of the European Conference on Computer Vision, 2014, 391- 405.
8	UIJLINGS J R R, VAN DE SANDE K E A, GEVERS T, et al. Selective search for object recognition. International Journal of Computer Vision, 2013, 104 (2): 154- 171.
9	RUSSAKOVSKY O, DENG J, SU H, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115 (3): 211- 252.
10	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks. Proc. of the Advances in Neural Information Processing Systems, 2012, 1097- 1105.
11	LECUN Y, BOSER B, DENKER J S, et al. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1 (4): 541- 551. doi: 10.1162/neco.1989.1.4.541
12	RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors. Nature, 1986, 323 (6088): 533. doi: 10.1038/323533a0
13	GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, 580- 587.
14	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition. Proc. of the International Conference on Learning Representations, 2015, 1- 14.
15	GIRSHICK R. Fast R-CNN. Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition, 2015, 1440- 1448.
16	REN S, HE K, GIRSHICK R, et al. Faster R-CNN:towards real-time object detection with region proposal networks. Proc. of the Advances in Neural Information Processing Systems, 2015, 91- 99.
17	REDMON J, DIVVALA S, GIRSHICK R, et al. You Only Look Once:unified, real-time object detection. Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition, 2016, 779- 788.
18	LIU W, ANGUELOV D, ERHAN D, et al. SSD:single shot multibox detector. Proc. of the European Conference on Computer Vision, 2016, 21- 37.
19	FU C Y, LIU W, RANGA A, et al. DSSD:deconvolutional single shot detector. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 1- 11.
20	SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, 1- 9.
21	BELL S, LAWRENCE ZITNICK C, BALA K, et al. Insideoutside net:detecting objects in context with skip pooling and recurrent neural networks. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 2874- 2883.
22	DAI J, LI Y, HE K, et al. R-FCN:object detection via regionbased fully convolutional networks. Proc. of the Advances in Neural Information Processing Systems, 2016, 379- 387.
23	HONG S, ROH B, KIM K H, et al. PVANet:lightweight deep neural networks for real-time object detection. Proc. of the Conference and Workshop on Neural Information Processing Systems, 2016, 1- 7.
24	LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 2117- 2125.
25	NAIR V, HINTON G E. Rectified linear units improve restricted boltzmann machines. Proc. of the 27th International Conference on Machine Learning, 2010, 807- 814.
26	IOFFE S, SZEGEDY C. Batch normalization:accelerating deep network training by reducing internal covariate shift. Proc. of the International Conference on Machine Learning, 2015, 448- 456.
27	HU P, RAMANAN D. Finding tiny faces. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 1522- 1530.
28	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation. IEEE Trans. on Pattern Analysis & Machine Intelligence, 2014, 39 (4): 640- 651.
29	ERHAN D, SZEGEDY C, TOSHEV A, et al. Scalable object detection using deep neural networks. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, 2147- 2154.
30	SZEGEDY C, REED S, ERHAN D, et al. Scalable, highquality object detection. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, 1- 8.
31	SHRIVASTAVA A, GUPTA A, GIRSHICK R. Training region-based object detectors with online hard example mining. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 761- 769.
32	HUBER P J. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 1964, 35 (1): 73- 101. doi: 10.1214/aoms/1177703732
33	EVERINGHAM M, VAN GOOL L, WILLIAMS C K I, et al. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 2010, 88 (2): 303- 338.
34	YU F, KOLTUN V. Multi-scale context aggregation by dilated convolutions. Proc. of the International Conference on Learning Recognition, 2016, 1- 13.
35	GLOROT X, BENGIO Y. Understanding the difficulty of training deep feedforward neural networks. Proc. of the 13th International Conference on Artificial Intelligence and Statistics, 2010, 249- 256.
36	HE K, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN. Proc. of the IEEE International Conference on Computer Vision, 2017, 2980- 2988.

Feature	Height	Width	Number
Conv3_3	0.035 4, 0.028 9, 0.070 7	0.070 7, 0.086 6, 0.035 4	33 750
Conv3_3	0.866 0, 0.050 0, 0.070 7	0.028 9, 0.050 0, 0.070 7	33 750
Conv4_3	0.070 7, 0.057 7, 0.141 4	0.141 4, 0.173 2, 0.070 7	8 864
Conv4_3	0.173 2, 0.100 0, 0.141 4	0.057 7, 0.100 0, 0.141 4	8 864
Conv6_2	0.141 4, 0.115 5, 0.282 8	0.282 8, 0.346 4, 0.141 4	2 166
Conv6_2	0.346 4, 0.200 0, 0.278 4	0.115 5, 0.200 0, 0.278 4	2 166
Conv7_2	0.274 0, 0.223 7, 0.548 0	0.548 0, 0.671 2, 0.274 0	600
Conv7_2	0.671 2, 0.387 5, 0.472 0	0.223 7, 0.387 5, 0.472 0	600
Conv8_2	0.406 6, 0.332 0, 0.813 20.995 9, 0.575 0, 0.662 1	0.813 2, 0.995 9, 0.406 60.332 0, 0.575 0, 0.662 11.078 3, 1.320 7, 0.539 2	150
Conv9_2	0.539 2, 0.440 2, 1.078 3	1.078 3, 1.320 7, 0.539 2	54
Conv9_2	1.320 7, 0.762 5, 0.851 1	0.440 2, 0.762 5, 0.851 1	54
Conv10_2	0.671 8, 0.548 5, 1.343 5	1.343 5, 1.645 4, 0.671 8	6
Conv10_2	1.645 4, 0.950 0, 1.039 5	0.548 5, 0.950 0, 1.039 5	6

Feature	Confidence kernel	Location kernel %number
Conv3_3	$3{\times} 3{\times} 1 024{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 1 024{\times} 4{\times} (n{+}1)$
Conv4_3	$3{\times} 3{\times} 2 048{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 2 048{\times} 4{\times} (n{+}1)$
Conv6_2	$3{\times} 3{\times} 2 048{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 2 048{\times} 4{\times} (n{+}1)$
Conv7_2	$3{\times} 3{\times} 1 792{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 1 792{\times} 4{\times} (n{+}1)$
Conv8_2	$3{\times} 3{\times} 1 024{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 1 024{\times} 4{\times} (n{+}1)$
Conv9_2	$3{\times} 3{\times} 768{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 768{\times} 4{\times} (n{+}1)$
Conv10_2	$3{\times} 3{\times} 512{\times} 6{\times} (n{+}1)$	$3{\times} 3{\times} 512{\times} 4{\times} (n{+}1)$

Class	Faster R-CNN	ION	RFCN	SSD 300	MR-CNN	TDBU-FPN
Aeroplane	76.5	79.2	79.0	81.0	80.3	82.6
Bicycle	79.0	83.1	80.3	84.2	84.1	84.5
Bird	70.9	77.6	76.6	76.7	78.5	78.6
Boat	65.5	65.6	67.0	72.1	70.8	75.9
Bottle	52.1	54.9	63.7	51.7	68.5	61.5
Bus	83.1	85.4	84.8	86.1	88.0	85.5
Car	84.7	85.1	85.6	86.1	85.9	86.9
Cat	86.4	87.0	89.1	85.0	87.8	85.5
Chair	52.0	54.4	62.2	63.0	60.3	64.1
Cow	81.9	80.6	85.3	82.0	85.2	81.6
Dining_table	65.7	73.8	67.9	76.9	73.7	78.1
Dog	84.8	85.3	87.3	85.5	87.2	86.7
Horse	84.6	82.2	86.6	87.3	86.5	88.5
Motorbike	77.5	82.2	82.8	84.8	85.0	85.2
Person	76.7	74.4	79.0	78.8	76.4	50.1
Potted_plant	38.8	47.1	51.0	50.4	48.5	58.1
Sheep	73.6	75.8	77.6	77.2	76.3	78.8
Sofa	73.9	72.7	75.2	80.2	75.5	78.7
Tvmonitor	83.0	84.2	83.5	87.6	85.0	88.5
Train	72.6	80.4	76.5	76.2	81.0	77.8
mAP	73.2	75.6	77.0	77.2	78.2	79.0

Setting	Number of anchor	Detection speed	MAP/%
Anchor-6	45 590	17	79.1
Anchor-4	30 256	24	78.1
Anchor-4 and Anchor-6	31 252	23	79.0

Setting	Anchor number	Detection speed	MAP/%
Feature-7	31 252	23	79.0
Feature-6	31 262	23	78.8
Feature-5	31 192	23	78.4
Feature-4	31 042	24	76.4

Multi-scale object detection by top-down and bottom-up feature pyramid network

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 20

参考文献 36

相关文章 0

编辑推荐

Metrics

本文评价