• CONTROL THEORY AND APPLICATION •

### Knowledge transfer in multi-agent reinforcement learning with incremental number of agents

Wenzhang LIU1(), Lu DONG2(), Jian LIU1(), Changyin SUN1,*()

1. 1 School of Automation, Southeast University, Nanjing 210096, China
2 School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China
• Received:2021-08-13 Accepted:2022-03-07 Online:2022-05-06 Published:2022-05-06
• Contact: Changyin SUN E-mail:wzliu@seu.edu.cn;ldong90@seu.edu.cn;bkliujian@163.com;cysun@seu.edu.cn
• About author:|LIU Wenzhang was born in 1993. He is a Ph.D. student in the School of Automation, Southeast University, Nanjing, China. He received his B.S. degree in engineering from Jilin University, Changchun, China, in 2016. He is currently working toward his Ph.D. degree in control science and engineering at Southeast University. His research interests include machine learning, deep reinforcement learning, optimal control, and multi-agent cooperative control. E-mail: wzliu@seu.edu.cn||DONG Lu was born in 1990. She received her B.S. degree in physics and Ph.D. degree in electrical engineering from Southeast University, Nanjing, China, in 2012 and 2017, respectively. She is currently an associate professor with the School of Cyber Science and Engineering, Southeast University. Her current research interests include adaptive dynamic programming, event-triggered control, nonlinear system control and optimization. E-mail: ldong90@seu.edu.cn||LIU Jian was born in 1992. He received his B.S. and Ph.D. degrees from the School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing, China, in 2015 and 2020, respectively. From September 2017 to September 2018, he was a joint training student with the Department of Mathematics, Dartmouth College, Hanover, NH, USA. From 2020 to 2021, he was a postdoctoral fellow with the School of Automation, Southeast University, Nanjing, China, where he is currently an associate professor. His current research interests include multi-agent systems, nonlinear control, event-triggered control, and fixed-time control. E-mail: bkliujian@163.com||SUN Changyin was born in 1975. He received his B.S. degree in applied mathematics from the College of Mathematics, Sichuan University, Chengdu, China, in 1996, and M.S. and Ph.D. degrees in electrical engineering from Southeast University, Nanjing, China, in 2001 and 2004, respectively. He is currently a professor with the School of Automation, Southeast University, Nanjing, China. His current research interests include intelligent control, flight control, and optimal theory. He is an associate editor of the IEEE Transactions on Neural Networks and Learning Systems, Neural Processing Letters, and the IEEE/CAA Journal of Automatica Sinica. E-mail: cysun@seu.edu.cn
• Supported by:
This work was supported by the National Key R&D Program of China (2018AAA0101400), the National Natural Science Foundation of China (62173251; 61921004; U1713209), the Natural Science Foundation of Jiangsu Province of China (BK20202006), and the Guangdong Provincial Key Laboratory of Intelligent Decision and Cooperative Control.

Abstract:

In this paper, the reinforcement learning method for cooperative multi-agent systems (MAS) with incremental number of agents is studied. The existing multi-agent reinforcement learning approaches deal with the MAS with a specific number of agents, and can learn well-performed policies. However, if there is an increasing number of agents, the previously learned in may not perform well in the current scenario. The new agents need to learn from scratch to find optimal policies with others, which may slow down the learning speed of the whole team. To solve that problem, in this paper, we propose a new algorithm to take full advantage of the historical knowledge which was learned before, and transfer it from the previous agents to the new agents. Since the previous agents have been trained well in the source environment, they are treated as teacher agents in the target environment. Correspondingly, the new agents are called student agents. To enable the student agents to learn from the teacher agents, we first modify the input nodes of the networks for teacher agents to adapt to the current environment. Then, the teacher agents take the observations of the student agents as input, and output the advised actions and values as supervising information. Finally, the student agents combine the reward from the environment and the supervising information from the teacher agents, and learn the optimal policies with modified loss functions. By taking full advantage of the knowledge of teacher agents, the search space for the student agents will be reduced significantly, which can accelerate the learning speed of the holistic system. The proposed algorithm is verified in some multi-agent simulation environments, and its efficiency has been demonstrated by the experiment results.