• SYSTEMS ENGINEERING •

### Choice of discount rate in reinforcement learning with long-delay rewards

Xiangyang LIN*(), Qinghua XING(), Fuxian LIU()

1. 1 Department of Air Defense and Anti-Missile, Air Force Engineering University, Xi’an 710051, China
• Received:2020-12-17 Accepted:2022-03-21 Online:2022-05-06 Published:2022-05-06
• Contact: Xiangyang LIN E-mail:95014052@qq.com;qh_xing@126.com;liuxqh@126.com
• About author:|LIN Xiangyang was born in 1994. He received his B.S. and M.S. degrees from Air Force Engineering University, Xi’an, in 2017 and 2019, respectively, where he is currently a Ph.D. student. His research interests include reinforcement learning and intelligent decision. E-mail: 95014052@qq.com||XING Qinghua was born in 1966. She received her B.S. degree from Shanxi University, Shanxi, China, in 1989, and M.S. and Ph.D. degrees from Air Force Engineering University, Xi’an, in 1992 and 2003, respectively, where she is currently a professor. Her research interests include system simulation modeling, combat decision analysis computer vision, and military system decision. E-mail: qh_xing@126.com||LIU Fuxian was born in 1962. He received his B.S. degree from Lanzhou University, Lanzhou, China, in 1994, and M.S. and Ph.D. degrees from Air Force Engineering University, Xi’an, in 1998 and 2001, respectively, where he is currently a professor. His research interests include deep learning and military system decision. E-mail: liuxqh@126.com
• Supported by:
This work was supported by the National Natural Science Foundation of China (71771216; 71701209; 72001214).

Abstract:

In the world, most of the successes are results of long-term efforts. The reward of success is extremely high, but before that, a long-term investment process is required. People who are “myopic” only value short-term rewards and are unwilling to make early-stage investments, so they hardly get the ultimate success and the corresponding high rewards. Similarly, for a reinforcement learning (RL) model with long-delay rewards, the discount rate determines the strength of agent’s “farsightedness”. In order to enable the trained agent to make a chain of correct choices and succeed finally, the feasible region of the discount rate is obtained through mathematical derivation in this paper firstly. It satisfies the “farsightedness” requirement of agent. Afterwards, in order to avoid the complicated problem of solving implicit equations in the process of choosing feasible solutions, a simple method is explored and verified by theoretical demonstration and mathematical experiments. Then, a series of RL experiments are designed and implemented to verify the validity of theory. Finally, the model is extended from the finite process to the infinite process. The validity of the extended model is verified by theories and experiments. The whole research not only reveals the significance of the discount rate, but also provides a theoretical basis as well as a practical method for the choice of discount rate in future researches.