Choice of discount rate in reinforcement learning with long-delay rewards

doi:10.23919/JSEE.2022.000040

Journal of Systems Engineering and Electronics ›› 2022, Vol. 33 ›› Issue (2): 381-392.doi: 10.23919/JSEE.2022.000040

• SYSTEMS ENGINEERING • Previous Articles Next Articles

Choice of discount rate in reinforcement learning with long-delay rewards

Xiangyang LIN*(), Qinghua XING(), Fuxian LIU()

¹ Department of Air Defense and Anti-Missile, Air Force Engineering University, Xi’an 710051, China

Received:2020-12-17 Accepted:2022-03-21 Online:2022-05-06 Published:2022-05-06
Contact: Xiangyang LIN E-mail:95014052@qq.com;qh_xing@126.com;liuxqh@126.com
About author:|LIN Xiangyang was born in 1994. He received his B.S. and M.S. degrees from Air Force Engineering University, Xi’an, in 2017 and 2019, respectively, where he is currently a Ph.D. student. His research interests include reinforcement learning and intelligent decision. E-mail: 95014052@qq.com||XING Qinghua was born in 1966. She received her B.S. degree from Shanxi University, Shanxi, China, in 1989, and M.S. and Ph.D. degrees from Air Force Engineering University, Xi’an, in 1992 and 2003, respectively, where she is currently a professor. Her research interests include system simulation modeling, combat decision analysis computer vision, and military system decision. E-mail: qh_xing@126.com||LIU Fuxian was born in 1962. He received his B.S. degree from Lanzhou University, Lanzhou, China, in 1994, and M.S. and Ph.D. degrees from Air Force Engineering University, Xi’an, in 1998 and 2001, respectively, where he is currently a professor. His research interests include deep learning and military system decision. E-mail: liuxqh@126.com
Supported by:
This work was supported by the National Natural Science Foundation of China (71771216; 71701209; 72001214).

Abstract

Abstract:

In the world, most of the successes are results of long-term efforts. The reward of success is extremely high, but before that, a long-term investment process is required. People who are “myopic” only value short-term rewards and are unwilling to make early-stage investments, so they hardly get the ultimate success and the corresponding high rewards. Similarly, for a reinforcement learning (RL) model with long-delay rewards, the discount rate determines the strength of agent’s “farsightedness”. In order to enable the trained agent to make a chain of correct choices and succeed finally, the feasible region of the discount rate is obtained through mathematical derivation in this paper firstly. It satisfies the “farsightedness” requirement of agent. Afterwards, in order to avoid the complicated problem of solving implicit equations in the process of choosing feasible solutions, a simple method is explored and verified by theoretical demonstration and mathematical experiments. Then, a series of RL experiments are designed and implemented to verify the validity of theory. Finally, the model is extended from the finite process to the infinite process. The validity of the extended model is verified by theories and experiments. The whole research not only reveals the significance of the discount rate, but also provides a theoretical basis as well as a practical method for the choice of discount rate in future researches.

Key words: reinforcement learning (RL), discount rate, long-delay reward, Q-learning, treasure-detecting model, feasible solution

Xiangyang LIN, Qinghua XING, Fuxian LIU. Choice of discount rate in reinforcement learning with long-delay rewards[J]. Journal of Systems Engineering and Electronics, 2022, 33(2): 381-392.

Figures/Tables 9

Fig 1

Fig 2

Fig 3

Table 1

Table 2

Table 3

Table 4

Fig 4

Fig 5

References 29

1	BELLMAN R A problem in the sequential design of experiments. The Indian Journal of Statistics, 1956, 16 (34): 221- 229.
2	SILVER D, HUANG A, MADDISON C J, et al Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529 (7587): 484- 489. doi: 10.1038/nature16961
3	LIN C J, JHANG J Y, LEE C L, et al Using a reinforcement Q-learning-based deep neural network for playing video games. Electronics, 2019, 8 (10): 1128. doi: 10.3390/electronics8101128
4	TAMASSIA M, ZAMBETTA F, RAFFE W L, et al Learning options from demonstrations: a pac-man case study. IEEE Trans. on Computational Intelligence and AI in Games, 2018, 10 (1): 91- 96.
5	WYDMUCH M, KEMPKA M, JASKOWSKI W. ViZDoom competitions: playing doom from pixels. IEEE Trans. on Computational Intelligence and AI in Games, 2019, 11(3): 248–259.
6	JADERBERG M, CZARNECKI W M, DUNNING I, et al Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science, 2019, 364 (6443): 859- 865. doi: 10.1126/science.aau6249
7	LIANG L, CHEN Y C, LIAO L C, et al A novel impedance control method of rubber unstacking robot dealing with unpredictable and time-variable adhesion force. Robotics and Computer-Integrated Manufacturing, 2021, 67, 102038. doi: 10.1016/j.rcim.2020.102038
8	GAO J L, YE W J, GUO J, et al Deep reinforcement learning for indoor mobile robot path planning. Sensors, 2020, 20 (19): 5493. doi: 10.3390/s20195493
9	XIE J Y, PENG X D, WANG H J, et al UAV autonomous tracking and landing based on deep reinforcement learning strategy. Sensors, 2020, 20 (19): 5630. doi: 10.3390/s20195630
10	XU X, ZUO L, LI X, et al A reinforcement learning approach to autonomous decision making of intelligent vehicles on highways. IEEE Trans. on Systems, Man and Cybernetics Systems, 2018, 50 (10): 3884- 3897.
11	HE Y, YU F R, ZHAO N, et al Software-defined networks with mobile edge computing and caching for smart cities: a big data deep reinforcement learning approach. IEEE Communications Magazine, 2017, 55 (12): 31- 37. doi: 10.1109/MCOM.2017.1700246
12	BRANDI S, PISCITELLI M S, MARTELLACCI M, et al Deep reinforcement learning to optimise indoor temperature control and heating energy consumption in buildings. Energy and Buildings, 2020, 224 (1): 110225.
13	KHAN A, LAPKIN A, Searching for optimal process routes: a reinforcement learning approach. Computers and Chemical Engineering, 2020, 141(4): 107027.
14	MA R, VANSTRUM E B, LEE R, et al Machine learning in the optimization of robotics in the operative field. Current Opinion in Urology, 2020, 30 (6): 808- 816. doi: 10.1097/MOU.0000000000000816
15	PARK H, SIM M K, CHOI D G. An intelligent financial portfolio trading strategy using deep Q-learning. Expert Systems with Applications, 2020, 158(15): 113573
16	HU Y, YAO Y, LEE W S A reinforcement learning approach for optimizing multiple traveling salesman problems over graphs. Knowledge-Based Systems, 2020, 204 (27): 106244.
17	AINSLIE G W Impulse control in pigeons. Journal of the Experimental Analysis of Behavior, 1974, 21 (3): 485- 489. doi: 10.1901/jeab.1974.21-485
18	TAKAHASHI T Loss of self-control in intertemporal choice may be attributable to logarithmic time-perception. Medical Hypotheses, 2005, 65 (4): 691- 693. doi: 10.1016/j.mehy.2005.04.040
19	NAKAHARA H, KAVERI S. Internal-time temporal difference model for neural value-based decision making. Neural Computation, 2010, 22(12): 3062–3106.
20	JARMOLOWICZ D P, HUDNALL J L, HALE L, et al Delay discounting as impaired valuation: delayed rewards in an animal obesity model. Journal of the Experimental Analysis of Behavior, 2017, 108 (2): 171- 183. doi: 10.1002/jeab.275
21	FOSCUE E P, WOOD K N, SCHRAMM-SAPYTA N L. Characterization of a semi-rapid method for assessing delay discounting in rodents. Pharmacology Biochemistry and Behavior, 2012, 101(2): 187–192
22	PAPALE A E, STOTT J J, POWELL N J, et al Interactions between deliberation and delay-discounting in rats. Cognitive, Affective, & Behavioral Neuroscience, 2012, 12 (3): 513- 526.
23	YAMAGUCHI Y, SAKAI Y, Reinforcement learning for discounted values often loses the goal in the application to animal learning. Neural Networks, 2012, 35(1): 88–91
24	KNOX W B, STONE P. Framing reinforcement learning from human reward: reward positivity, temporal discounting, episodicity, and performance. Artificial Intelligence, 2015, 225(1): 24–50
25	WANG J P, WANG G, MAO X B, et al Motion control method of two-link manipulator based on deep reinforcement learning. Journal of Computer Applications, 2021, 41 (6): 1799- 1804.
26	WEI H B, HE S C Multi-objective optimal control strategy for plug-in diesel electric hybrid vehicles based on deep reinforcement learning. Journal of Chongqing Jiaotong University (Natural Science), 2021, 40 (1): 44- 52.
27	LI C, HUANG Y Y, ZHANG Y L, et al Multi-agent decision-making method based on Actor-Critic framework and its application in wargame. Systems Engineering and Electronics, 2020, 43 (3): 755- 762.
28	ZHANG Q H, AO B Q, ZHANG Q X Reinforcement learning guidance law of Q-learning. Journal of Systems Engineering and Electronics, 2019, 42 (2): 414- 419.
29	SUTTON R S, BARTO A G. Reinforcement learning: an introduction. 2nd ed. Cadge: MIT Press, 2018.

γ	R_max
γ	20	5e	8	6
0.0	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.1	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.2	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.3	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.4	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.5	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.6	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.7	{0,0,0,0,0}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.8	{0,0,0,0,0}	{0,0,0,0,0}	{1,1,1,1,1}	{1,1,1,1,1}
0.9	{0,0,0,0,0}	{0,0,0,0,0}	{1,1,1,1,1}	{1,1,1,1,1}
1.0	{0,0,0,0,0}	{0,0,0,0,0}	{1,1,1,1,1}	{1,1,1,1,1}

State	Action
State	0	1
[0,0]	?1	0
[0,1]	?1	0
[0,2]	?1	0
[0,3]	?1	0
[0,4]	?1	0
[1,0]	?0.74249	0
[1,1]	?1	0
[1,2]	?1	0
[1,3]	?1	0
[2,0]	1.108516	0
[2,1]	?0.99927	0
[2,2]	?0.99997	0
[3,0]	5.794161	0
[3,1]	?0.89989	0
[4,0]	16.19077	0

γ	R_max
γ	20	5e	8	6
0.0	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.1	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.2	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.3	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.4	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.5	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.6	{0,0,0,0,0}	{1,1,1,1,1}	{1,1,1,1,1}	{1,1,1,1,1}
0.7	{0,0,0,0,0}	{0,0,0,0,0}	{1,1,1,1,1}	{1,1,1,1,1}
0.8	{0,0,0,0,0}	{0,0,0,0,0}	{0,0,0,0,0}	{1,1,1,1,1}
0.9	{0,0,0,0,0}	{0,0,0,0,0}	{0,0,0,0,0}	{0,0,0,0,0}
1.0	{0,0,0,0,0}	{0,0,0,0,0}	{0,0,0,0,0}	{0,0,0,0,0}

State	Action
State	0	1
[0,0]	0.416	0
[0,1]	?1	0
[0,2]	?1	0
[0,3]	?1	0
[0,4]	?1	0
[1,0]	2.36	0
[1,1]	?1	0
[1,2]	?1	0
[1,3]	?1	0
[2,0]	5.6	0
[2,1]	?1	0
[2,2]	?1	0
[3,0]	11	0
[3,1]	?1	0
[4,0]	20	0

[1]	Xiaofeng LI, Lu DONG, Changyin SUN. Hybrid Q-learning for data-based optimal control of non-linear switching system [J]. Journal of Systems Engineering and Electronics, 2022, 33(5): 1186-1194.
[2]	Wanping SONG, Zengqiang CHEN, Mingwei SUN, Qinglin SUN. Reinforcement learning based parameter optimization of active disturbance rejection control for autonomous underwater vehicle [J]. Journal of Systems Engineering and Electronics, 2022, 33(1): 170-179.
[3]	Xin ZENG, Yanwei ZHU, Leping YANG, Chengming ZHANG. A guidance method for coplanar orbital interception based on reinforcement learning [J]. Journal of Systems Engineering and Electronics, 2021, 32(4): 927-938.
[4]	Ye MA, Tianqing CHANG, Wenhui FAN. A single-task and multi-decision evolutionary game model based on multi-agent reinforcement learning [J]. Journal of Systems Engineering and Electronics, 2021, 32(3): 642-657.
[5]	Chen Chen, Jie Chen, and Bin Xin. Hybrid optimization of dynamic deployment for networked fire control system [J]. Journal of Systems Engineering and Electronics, 2013, 24(6): 954-961.

Choice of discount rate in reinforcement learning with long-delay rewards

RichHTML

PDF (PC)

Knowledge

Abstract

Cite this article

Share this article

Figures/Tables 9

References 29

Related Articles 5

Recommended Articles

Metrics

Comments