Self-play training and analysis for GEO inspection game with modular actions

doi:10.23919/JSEE.2025.000115

Journal of Systems Engineering and Electronics ›› 2025, Vol. 36 ›› Issue (5): 1353-1373.doi: 10.23919/JSEE.2025.000115

• CONTROL THEORY AND APPLICATION • Previous Articles

Self-play training and analysis for GEO inspection game with modular actions

Rui ZHOU¹^,²(), Weichao ZHONG³(), Wenlong LI³(), Hao ZHANG¹^,²^,*()

¹ Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences, Beijing 100094, China
² University of Chinese Academy of Sciences, Beijing 100049, China
³ Shanghai Institute of Satellite Engineering, Shanghai 201109, China

Received:2024-09-18 Online:2025-10-18 Published:2025-10-24
Contact: Hao ZHANG E-mail:zhourui221@mails.ucas.ac.cn;zhongweichaohit@gmail.com;liwenlongzacao@126.com;hao.zhang.zhr@gmail.com
About author:
ZHOU Rui was born in 2000. He received his B.S. degree at the Harbin Engineering University, Harbin, China, in 2022, and M.S. degree in flight vehicle design at the University of Chinese Academy of Sciences. His research interests include the dynamics and control of spacecraft, application of artificial intelligence to the astronautic systems. E-mail: zhourui221@mails.ucas.ac.cn

ZHONG Weichao was born in 1985. He received his B.S., M.S., and Ph.D. degrees in aerospace engineering from the Harbin Institute of Technology, Harbin, China, in 2007, 2009, and 2014, respectively. He is currently an engineer with Shanghai Institute of Satellite Engineering. His research interests are testbed of satellite, simulation of dynamics, and optical remote sensing technology. E-mail: zhongweichaohit@gmail.com

LI Wenlong was born in 1988. He received his Ph.D. degree in spacecraft design from Beijing University of Aeronautics and Astronautics in 2015, specializing in the orbit mechanics. Currently, he works at the Shanghai Institute of Satellite Engineering. His research interest is satellite system design. E-mail: liwenlongzacao@126.com

ZHANG Hao was born in 1986. He received his B.S. and Ph.D. degrees in aerospace engineering from Beihang University, China, in 2006 and 2012, respectively. From 2012 to 2017, he was with the Asher Space Research Institute, Technion–Israel Institute of Technology, where he was a postdoctoral fellow. He is currently a research fellow with the Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences, China. His research interests include distributed space systems and astrodynamics. E-mail: hao.zhang.zhr@gmail.com

Abstract

Abstract:

This paper comprehensively explores the impulsive on-orbit inspection game problem utilizing reinforcement learning and game training methods. The purpose of the spacecraft is to inspect the entire surface of a non-cooperative target with active maneuverability in front lighting. First, the impulsive orbital game problem is formulated as a turn-based sequential game problem. Second, several typical relative orbit transfers are encapsulated into modules to construct a parameterized action space containing discrete modules and continuous parameters, and multi-pass deep Q-networks (MPDQN) algorithm is used to implement autonomous decision-making. Then, a curriculum learning method is used to gradually increase the difficulty of the training scenario. The backtracking proportional self-play training framework is used to enhance the agent’s ability to defeat inconsistent strategies by building a pool of opponents. The behavior variations of the agents during training indicate that the intelligent game system gradually evolves towards an equilibrium situation. The restraint relations between the agents show that the agents steadily improve the strategy. The influence of various factors on game results is tested.

Key words: impulsive orbital game, inspection mission, turn-based, reinforcement learning, modular action, self-play

Rui ZHOU, Weichao ZHONG, Wenlong LI, Hao ZHANG. Self-play training and analysis for GEO inspection game with modular actions[J]. Journal of Systems Engineering and Electronics, 2025, 36(5): 1353-1373.

Figures/Tables 35

Fig 1

Fig 2

Table 1

Fig 3

Fig 4

Fig 5

Fig 6

Table 6

Fig 7

Table 2

Table 3

Table 4

Table 5

Table 7

Fig 8

Fig 9

Fig 10

Fig 11

Fig 12

Fig 13

Fig 14

Fig 15

Fig 16

Fig 17

Fig 18

Fig 19

Fig 20

Fig 21

Fig 22

Fig 23

Fig 24

Fig 25

Table 8

Table 9

Fig 26

References 34

1	WU L J, ZHONG W C, LI W L, et al The modular relative orbit design method for spacecraft proximity motion. Journal of Physics: Conference Series, 2022, 2252 (1): 12- 32.
2	ZHU R Z, TANG Y Design methods for the closing and fly-around of space rendezvous. Chinese Space Science and Technology, 2005, 25 (1): 7- 14.
3	ZHAO G D, GUO Y N, DENG W D, et al. Natural fly-around orbital maneuvers strategy for geo spacecraft considering illumination constraints. Proc. of the Chinese Control Conference, 2019: 8182−8187.
4	PAN Y Study on spacecraft relative drip-drop hovering orbit. Spacecraft Engineering, 2014, 23 (4): 13- 18.
5	YE D, SHI M M, SUN Z W. Satellite proximate pursuit-evasion game with different thrust configurations. Aerospace Science and Technology, 2020, 99: 105715.
6	SHI M M, YE D, SUN Z W, et al Spacecraft orbital pursuit–evasion games with J2 perturbations and direction-constrained thrust. Acta Astronautica, 2023, 202, 139- 150. doi: 10.1016/j.actaastro.2022.10.004
7	XIE W Y, ZHAO L R, DANG Z H. Game tree search-based impulsive orbital pursuit-evasion game with limited actions. Space: Science & Technology, 2024, 4: 87−99.
8	HAN H, DANG Z. Optimal delta-V-based strategies in orbital pursuit-evasion games. Advances in Space Research. 2023, 72(2): 243−256.
9	LI Z Y, ZHU H, LUO Y Z Orbital inspection game formulation and epsilon-Nash equilibrium solution. Journal of Spacecraft and Rockets, 2024, 61 (1): 157- 172. doi: 10.2514/1.A35800
10	BRANDONISIO A, LAVAGNA M, GUZZETTI D. Reinforcement learning for uncooperative space objects smart imaging path-planning. The Journal of the Astronautical Sciences, 2021, 68(4): 1145−1169.
11	SILVER D, HUANG A, MADDISON C J, et al Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529 (7587): 484- 489. doi: 10.1038/nature16961
12	SILVER D, SCHRITTWIESER J, SIMONYAN K, et al Mastering the game of Go without human knowledge. Nature, 2017, 550 (7676): 354- 359. doi: 10.1038/nature24270
13	VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019, 575 (7782): 350- 354. doi: 10.1038/s41586-019-1724-z
14	XIONG J C, WANG Q, YANG Z R, et al. Parametrized deep Q-networks learning: reinforcement learning with discrete-continuous hybrid action space. https://arXiv.org/abs/1810.06394.
15	BESTER C J, STEVEN D J, GEORGE D K. Multi-pass q-networks for deep reinforcement learning with parameterised action spaces. https://arXiv.org/abs/1905.04388.
16	HAUSKNECHT M, STONE P. Deep reinforcement learning in parameterized action space. https://arXiv.org/abs/1511.04143.
17	FAN Z, SU R, ZHANG W, et al. Hybrid actor-critic reinforcement learning in parameterized action space. Proc. of the 28th International Joint Conference on Artificial Intelligence, 2019: 2279−2285.
18	WANG X, SHI P, ZHAO Y S, et al A pre-trained fuzzy reinforcement learning method for the pursuing satellite in a one-to-one game in space. Sensors, 2020, 20 (8): 2253- 2268. doi: 10.3390/s20082253
19	YANG B, LIU P X, FENG J L, et al Two-stage pursuit strategy for incomplete-information impulsive space pursuit-evasion mission using reinforcement learning. Aerospace, 2021, 8 (10): 299- 315. doi: 10.3390/aerospace8100299
20	ZHAO L R, ZHANG Y L, DANG Z H PRD-MADDPG: an efficient learning-based algorithm for orbital pursuit-evasion game with impulsive maneuvers. Advances in Space Research, 2023, 72 (2): 211- 230. doi: 10.1016/j.asr.2023.03.014
21	JIANG R, YE D, XIAO Y, et al. Orbital interception pursuit strategy for random evasion using deep reinforcement learning. Space: Science & Technology, 2023, 3: 86−99.
22	ALFRIEND K, VADALI S R, GURFIL P, et al. Spacecraft formation flying: dynamics, control and navigation. Oxford: Elsevier, 2009.
23	CLOHESSY W H, WILTSHIRE R S Terminal guidance system for satellite rendezvous. Journal of Aerospace Sciences, 1960, 27 (9): 653- 658. doi: 10.2514/8.8704
24	VAN OTTERLO M, WIERING M. Reinforcement learning: state-of-the-art. New York: Springer Science & Business Media, 2012.
25	Ansys Government Initiatives. Introduction to rendezvous and proximity operation sequences in STK. https://help.agi.com/stk/index.htm#gator/rpo_intro.htm#Config.
26	MASSON W, RANCHOD P, KONIDARIS G. Reinforcement learning with parameterized actions. Proc. of the AAAI Conference on Artificial Intelligence, 2016: 1934−1940.
27	HUANG S, ONTANON S. A closer look at invalid action masking in m gradient algorithms. https://arXiv.org/abs/2006.14171.
28	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning. https://arXiv.org/abs/1312.5602.
29	LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning. https://arXiv.org/abs/1509.02971.
30	SUTTON R S, BARTO A G. Reinforcement learning: an introduction. Cambridge: MIT Press, 2018.
31	NARVEKAR S, PENG B, LEONETTI M, et al Curriculum learning for reinforcement learning domains: a framework and survey. Journal of Machine Learning Research, 2020, 21 (181): 1- 50.
32	HERNANDEZ D, DENAMGANAI K, GAO Y, et al. A generalized framework for self-play training. Proc. of the IEEE Conference on Games, 2019: 1−8.
33	BALDUZZI D, GARNELO M, BACHRACH Y, et al. Open-ended learning in symmetric zero-sum games. Proc. of the International Conference on Machine Learning, 2019: 434−443.
34	BERNER C, BROCKMAN G, CHAN B, et al. Dota 2 with large scale deep reinforcement learning. https://arXiv.org/abs/1912.06680.

Module	Description
Hop	Transfer between two
ToVbar	Double impulse transfer to V-bar
Glideslope	Multiple impulse gliding
Natural motion circumnavigation (NMC)	Natural motion circumnavigation
Teardrop	Drop-like periodic hovering

Module	Parameter	Range
Hop	$ \Delta {x_\text{H}} $/km	[5(away from V-bar), 10(toward V-bar)]
Hop	$ \Delta {y_\text{H}} $/km	[−35, 35]
ToVbar	$ {t_\text{V}} $/h	[1,5]
ToVbar	$ \Delta {y_\text{V}} $/km	[−12, 12]
Glideslope	$ \Delta {x_{\mathrm{G}}} $/km	[6(away from V-bar), 12(toward V-bar)]
Glideslope	$ \Delta {y_{\mathrm{G}}} $/km	[−40, 40]
NMC	$ \Delta {x_\text{N}} $/km	[−5, 5]
	$ \Delta {y_\text{N}} $/km	[−10, 10]
	$ \Delta {y_{\mathrm{c}}} $/km	[−25, 25]
Teardrop	${t_\text{T}}$/h	[4,8]
	$ \Delta {x_\text{T}} $/km	[0, 7(toward V-bar)]
	$ \Delta {y_\text{T}} $/km	[−12, 12]

Layer name	Node	Activation function
Input layer	7	Leaky ReLU
Hidden layers	(64,64,128,128)	Leaky ReLU
Output layer	12	Leaky ReLU

Layer name	Node	Activation function
Input layer	19	Leaky ReLU
Hidden layers	(64,64,128,128)	Leaky ReLU
Output layer	5	None

Parameter	Value
Learning rate of discrete action network	0.0005
Learning rate of continuous parameter network	0.00003
Update rate of discrete action network	0.005
Update rate of continuous parameter network	0.001
Discount factor	0.9
Gradient clipping	1
Batch size	256

Self-play training and analysis for GEO inspection game with modular actions

RichHTML

PDF (PC)

Knowledge

Abstract

Cite this article

Share this article

Figures/Tables 35

References 34

Related Articles 15

Recommended Articles

Metrics

Comments

Parameter	Value
$ {t_{\max }} $/day	10
$ \Delta {v_{\max }} $/(m/s)	2.4
$ \Delta {V_{\max }} $/(m/s)	50
${x_{\max }}$/km	25
${y_{\max }}$/km	100

Parameter	Value
${\rho _{\max }}$/km	15
${\alpha _l}$/(°)	−60
${\alpha _u}$/(°)	60
${\beta _l}$/(°)	−45
${\beta _u}$/(°)	45
${\tau _{\min }}$/h	3

Terminal time/ day	Win rate/ %	Main agent’s reward	Opponents’ reward
2.5	58.5	26.4	−6.4
5	77.9	60.0	−11.3
7.5	82.2	64.5	−17.4
10	83.2	63.5	−24.1

[1]	Wenhao CHEN, Gang CHEN, Jichao LI, Jiang JIANG. Disintegration of heterogeneous combat network based on double deep Q-learning [J]. Journal of Systems Engineering and Electronics, 2025, 36(5): 1235-1246.
[2]	Siyu HENG, Ting CHENG, Zishu HE, Yuanqing WANG, Luqing LIU. Adaptive dwell scheduling based on Q-learning for multifunctional radar system [J]. Journal of Systems Engineering and Electronics, 2025, 36(4): 985-993.
[3]	Yifan ZHANG, Tao DONG, Zhihui LIU, Shichao JIN. Multi-QoS routing algorithm based on reinforcement learning for LEO satellite networks [J]. Journal of Systems Engineering and Electronics, 2025, 36(1): 37-47.
[4]	Nanxun DUO, Qinzhao WANG, Qiang LYU, Wei WANG. Tactical reward shaping for large-scale combat by multi-agent reinforcement learning [J]. Journal of Systems Engineering and Electronics, 2024, 35(6): 1516-1529.
[5]	Guofei LI, Shituo LI, Bohao LI, Yunjie WU. Deep reinforcement learning guidance with impact time control [J]. Journal of Systems Engineering and Electronics, 2024, 35(6): 1594-1603.
[6]	Qi WANG, Zhizhong LIAO. Computational intelligence interception guidance law using online off-policy integral reinforcement learning [J]. Journal of Systems Engineering and Electronics, 2024, 35(4): 1042-1052.
[7]	Guang ZHAN, Kun ZHANG, Ke LI, Haiyin PIAO. UAV maneuvering decision-making algorithm based on deep reinforcement learning under the guidance of expert experience [J]. Journal of Systems Engineering and Electronics, 2024, 35(3): 644-665.
[8]	Yaozhong ZHANG, Zhuoran WU, Zhenkai XIONG, Long CHEN. A UAV collaborative defense scheme driven by DDPG algorithm [J]. Journal of Systems Engineering and Electronics, 2023, 34(5): 1211-1224.
[9]	Jiawei XIA, Xufang ZHU, Zhong LIU, Qingtao XIA. LSTM-DPPO based deep reinforcement learning controller for path following optimization of unmanned surface vehicle [J]. Journal of Systems Engineering and Electronics, 2023, 34(5): 1343-1358.
[10]	Yunxiu ZENG, Kai XU. Recognition and interfere deceptive behavior based on inverse reinforcement learning and game theory [J]. Journal of Systems Engineering and Electronics, 2023, 34(2): 270-288.
[11]	Yaozhong ZHANG, Yike LI, Zhuoran WU, Jialin XU. Deep reinforcement learning for UAV swarm rendezvous behavior [J]. Journal of Systems Engineering and Electronics, 2023, 34(2): 360-373.
[12]	Lu DONG, Zichen HE, Chunwei SONG, Changyin SUN. A review of mobile robot motion planning methods: from classical motion planning workflows to reinforcement learning-based architectures [J]. Journal of Systems Engineering and Electronics, 2023, 34(2): 439-459.
[13]	Guangran CHENG, Lu DONG, Xin YUAN, Changyin SUN. Reinforcement learning-based scheduling of multi-battery energy storage system [J]. Journal of Systems Engineering and Electronics, 2023, 34(1): 117-128.
[14]	Peng LIU, Boyuan XIA, Zhiwei YANG, Jichao LI, Yuejin TAN. A deep reinforcement learning method for multi-stage equipment development planning in uncertain environments [J]. Journal of Systems Engineering and Electronics, 2022, 33(6): 1159-1175.
[15]	Bohao LI, Yunjie WU, Guofei LI. Hierarchical reinforcement learning guidance with threat avoidance [J]. Journal of Systems Engineering and Electronics, 2022, 33(5): 1173-1185.

Terminal time/ day	Module
Terminal time/ day	Hop	ToVbar	Glideslope	NMC	Teardrop
2.5	25.6	10.2	11.6	31.4	21.3
5	32.7	11.0	11.6	23.1	21.6
7.5	37.5	11.3	11.3	20.2	19.6
10	40.9	11.4	11.3	18.4	17.9