Application of self-play reinforcement learning and explainable decision tree in intelligent air combat

doi:10.23919/JSEE.2026.000079

Journal of Systems Engineering and Electronics ›› 2026, Vol. 37 ›› Issue (2): 616-635.doi: 10.23919/JSEE.2026.000079

• SYSTEMS ENGINEERING • Previous Articles

Application of self-play reinforcement learning and explainable decision tree in intelligent air combat

Jingbo WANG¹^,²(), Liaoyuan ZHU²(), Shaojie XIA²(), Huibin LIU²(), Jing LIU²(), Chongxiao QU²^,*(), Zhihuan SONG¹()

¹College of Control Science and Engineering, Zhejiang University, Hangzhou 310027, China
²The 52nd Research Institute of China Electronics Technology Group Corporation, Hangzhou 311100, China

Received:2024-01-08 Online:2026-04-18 Published:2026-04-30
Contact: Chongxiao QU E-mail:wangjingbo2@cetc.com.cn;zhuliaoyuan1@cetc.com.cn;xiashaojie@cetc.com.cn;liuhuibin@cetc.com.cn;liujing2@cetc.com.cn;quchongxiao@cetc.com.cn;songzhihuan@zju.edu.cn
About author:
WANG Jingbo was born in 1995. He received his Ph.D. degree in control science and engineering from Zhejiang University, Hangzhou, China, in 2022. He is currently an engineer in the 52nd Research Institute of China Electronics Technology Group Corporation, and a Postdoctoral Research Fellow with the College of Control Science and Engineering, Zhejiang University. His research interests include intelligent air combat, and deep reinforcement learning. E-mail: wangjingbo2@cetc.com.cn

ZHU Liaoyuan was born in 1992. He received his M.Eng. degree in control engineering from Zhejiang University, Hangzhou, China, in 2017. He is currently an engineer in the 52nd Research Institute of China Electronics Technology Group Corporation. His research interests include intelligent air combat, and deep reinforcement learning. E-mail: zhuliaoyuan1@cetc.com.cn

XIA Shaojie was born in 1989. He received his M.S. degree in control science and engineering from Zhejiang University of Technology, Hangzhou, China, in 2015. He is currently a senior engineer in the 52nd Research Institute of China Electronics Technology Group Corporation. His research interests include autonomous air combat, intelligent decision, and deep reinforcement learning. E-mail: xiashaojie@cetc.com.cn

LIU Huibin was born in 1992. He received his M.Eng. degree in computer technology from Beijing University of Posts and Telecommunications, Beijing, China, in 2021. He is currently an assistant engineer in the 52nd Research Institute of China Electronics Technology Group Corporation. His research interests include air combat simulation and deep reinforcement learning. E-mail: liuhuibin@cetc.com.cn

LIU Jing was born in 1996. She received her M.S. degree in control science and engineering from Harbin Engineering University, Harbin, China, in 2022. She is currently an assistant engineer in the 52nd Research Institute of China Electronics Technology Group Corporation. Her research interests include air combat simulation, and deep reinforcement learning. E-mail: liujing2@cetc.com.cn

QU Chongxiao was born in 1985. He received his M.S. degree in business administration from Zhejiang University in 2014. He is currently a Ph.D. candidate in electronic and information engineering from University of Electronic Science and Technology of China, and a professorate senior engineer in the 52nd Research Institute of China Electronics Technology Group Corporation. His research interests include autonomous air combat, intelligent decision, and deep reinforcement learning. E-mail: quchongxiao@cetc.com.cn

SONG Zhihuan was born in 1962. He received his B.Eng. and M.Eng. degrees in industrial automation from Hefei University of Technology, Hefei, China, in 1983 and 1986, respectively, and Ph.D. degree in industrial automation from Zhejiang University, Hangzhou, China, in 1997. Since 1997, He has been with the College of Control Science and Engineering, Zhejiang University, where he is currently a professor. His research interests include intelligent decision, machine learning, reinforcement learning, analytics and applications of big data, and advanced process control technologies. E-mail: songzhihuan@zju.edu.cn
Supported by:
This work was supported by the Joint Funds of the National Natural Science Foundation of China (U2341216).

Abstract

Abstract:

Deep reinforcement learning algorithms are revolutionizing intelligent decision-making in air combat, drawing widespread attention and extensive research. However, air combat agents trained with these algorithms face significant challenges, such as limited decision-making capacities due to adversarial training against relatively fixed and singular expert strategies, and a lack of interpretability and reliability in their decision-making processes. To tackle these issues, this paper proposes a self-play training mechanism based on policy switching and opponent selection, allowing air combat agents to refine their capabilities via engaging with previous versions of themselves. Additionally, an explainable decision tree model is developed to clarify the decision logic of these agents. Simulations and results demonstrate that the proposed self-play training approach significantly enhances the decision-making abilities of air combat agents, with late-stage agents showing a 38% improvement over early-stage agents in confrontations with an expert strategy. Moreover, the explainable decision tree model effectively elucidates the decision logic and achieves an 86% win rate against the expert strategy, comparable to the 88% win rate of the air combat agents.

Key words: deep reinforcement learning, intelligent air combat, self-play training, explainable decision tree

Jingbo WANG, Liaoyuan ZHU, Shaojie XIA, Huibin LIU, Jing LIU, Chongxiao QU, Zhihuan SONG. Application of self-play reinforcement learning and explainable decision tree in intelligent air combat[J]. Journal of Systems Engineering and Electronics, 2026, 37(2): 616-635.

Figures/Tables 23

Fig 1

Fig 2

Fig 3

Table 1

Fig 4

Fig 5

Table 2

Fig 6

Fig 7

Table 3

Fig 8

Fig 9

Fig 10

Table 4

Fig 11

Fig 12

Table 5

Table 6

Table 7

Fig 13

Fig 14

Table 8

Fig 15

References 41

1	ZHANG J D, YANG Q M, SHI G Q, et al UAV cooperative air combat maneuver decision based on multi-agent reinforcement learning. Journal of Systems Engineering and Electronics, 2021, 32 (6): 1421- 1438. doi: 10.23919/jsee.2021.000121
2	ZHU J G, KUANG M C, ZHOU W Q, et al Mastering air combat game with deep reinforcement learning. Defence Technology, 2024, 34, 295- 312. doi: 10.1016/j.dt.2023.08.019
3	WANG N, LI Z, LIANG X L, et al A review of deep reinforcement learning methods and military application research. Mathematical Problems in Engineering, 2023, 1, 7678382. doi: 10.1155/2023/7678382
4	HU D Y, YANG R N, ZUO J L, et al Application of deep reinforcement learning in maneuver planning of beyond-visual-range air combat. IEEE Access, 2021, 9, 32282- 32297. doi: 10.1109/ACCESS.2021.3060426
5	REN Z, ZHANG D, TANG S, et al Cooperative maneuver decision making for multi-UAV air combat based on incomplete information dynamic game. Defence Technology, 2023, 27, 308- 317. doi: 10.1016/j.dt.2022.10.008
6	ZHANG T Y , WANG Y S, SUN M W, et al Air combat maneuver decision based on deep reinforcement learning with auxiliary reward. Neural Computing & Applications, 2024, 36 (21): 13341- 13356. doi: 10.1007/s00521-024-09720-z
7	BULLOCK H E. ACE: the airborne combat expert systems: an exposition in two parts. Fort Belvoir: Defense Technical Information Center, 1986. (in Chinese)
8	WANG X, WANG W J, SONG K P UAV air combat decision based on evolutionary expert system tree. Ordnance Industry Automation, 2019, 38 (1): 42- 47.
9	ISCI H, GUNEL G O Fuzzy logic based air-to-air combat algorithm for unmanned air vehicles. International Journal of Dynamics and Control, 2022, 10 (1): 230- 242. doi: 10.1007/s40435-021-00803-6
10	SMITH R E, DIKE B A, MEHRA R K, et al Classifier systems in combat: two-sided learning of maneuvers for advanced fighter aircraft. Computer Methods in Applied Mechanics and Engineering, 2000, 186 (2/4): 421- 437. doi: 10.1016/s0045-7825(99)00395-3
11	ERNEST N, CARROLL D, SCHUMACHER C, et al Genetic fuzzy based artificial intelligence for unmanned combat aerial vehicle control in simulated air combat missions. Journal of Defense Management, 2016, 6 (1): 114. doi: 10.4172/2167-0374.1000144
12	GARCIA E, CASBEER D W, PACHTER M Active target defence differential game: fast defender case. IET Control Theory and Applications, 2017, 11 (17): 2985- 2993. doi: 10.1049/iet-cta.2017.0302
13	LI S Y, CHEN M, WANG Y H, et al Air combat decision-making of multiple UCAVs based on constraint strategy games. Defence Technology, 2022, 18 (3): 368- 383. doi: 10.1016/j.dt.2021.01.005
14	LI S Y, CHEN M, WANG Y H, et al A fast algorithm to solve large-scale matrix games based on dimensionality reduction and its application in multiple unmanned combat air vehicles attack-defense decision-making. Information Sciences, 2022, 594, 305- 321. doi: 10.1016/j.ins.2022.02.025
15	LIU R Z, GUO H F, JI X Z, et al Efficient reinforcement learning for StartCraft by abstract forward models and transfer learning. IEEE Trans. on Games, 2022, 14 (2): 294- 307. doi: 10.1109/TG.2021.3071162
16	LUO Q, TAN T P RARSMSDou: master the game of DouDiZhu with deep reinforcement learning algorithms. IEEE Trans. on Emerging Topics in Computational Intelligence, 2024, 8 (1): 427- 439. doi: 10.1109/TETCI.2023.3303251
17	DIMITRIU A, MICHALETZKY T V, REMELI V, et al A reinforcement learning approach to military simulations in Command: modern operations. IEEE Access, 2024, 12, 77501- 77513. doi: 10.1109/ACCESS.2024.3406148
18	SUN Y, YUAN B, ZHANG T, et al Research and implementation of intelligent decision based on a priori knowledge and DQN algorithms in wargame environment. Electronics, 2020, 9 (10): 1668. doi: 10.3390/electronics9101668
19	TANG B J, SUN Y X, YU J H, et al. Parallel intelligent command decision-making technology based on combat prior knowledge and reinforcement learning algorithm. Proc. of the International Conference on Cyber-Physical Social Intelligence, 2021. DOI: 10.1109/ICCSI53130.2021.9736221.
20	SHI W, FENG Y H, CHENG G Q, et al Research on multi-aircraft cooperative air combat method based on deep reinforcement learning. Acta Automatica Sinica, 2021, 47 (7): 1610- 1623.
21	SUN Y X, YUAN B, XIANG Q, et al Intelligent decision-making and human language communication based on deep reinforcement learning in a wargame environment. IEEE Trans. on Human-Machine Systems, 2023, 53 (1): 201- 214. doi: 10.1109/THMS.2022.3225867
22	ZHANG Q, YANG R N, YU L X, et al BVR air combat maneuvering decision by using Q-network reinforcement learning. Journal of Air Force Engineering University (Natural Science Edition), 2018, 19 (6): 8- 14.
23	CAO Y, KOU Y X, LI Z W, et al Autonomous maneuver decision of UCAV air combat based on double deep Q network algorithm and stochastic game theory. International Journal of Aerospace Engineering, 2023, 2023 (1): 3657814. doi: 10.1155/2023/3657814
24	WANG Y, ZHANG X W, ZHOU R, et al Research on UCAV maneuvering decision method based on heuristic reinforcement learning. Computational Intelligence and Neuroscience, 2022, 2022 (1): 1477078.
25	DING W, WANG Y, DING D L, et al Maneuvering decision of UCAV in close air combat based on LSTMPPO algorithm. Journal of Air Force Engineering University (Natural Science Edition), 2022, 23 (3): 19- 25.
26	POPE A P, IDE J S, MICOVIC D, et al. Hierarchical reinforcement learning for air combat at DARPA’s AlphaDogfight trials. IEEE Trans. on Artificial Intelligence, 2023, 4(6): 1371−1385.
27	FAN Z H, XU Y, KANG Y H, et al Air combat maneuver decision method based on A3C deep reinforcement learning. Machines, 2022, 10 (11): 1033. doi: 10.3390/machines10111033
28	BAI S X, SONG S M, LIANG S Y, et al UAV maneuvering decision-making algorithm based on twin delayed deep deterministic policy gradient algorithm. Journal of Artificial Intelligence and Technology, 2022, 2 (1): 16- 22. doi: 10.37965/jait.2021.12003
29	SUN Z X, PIAO H Y, YANG Z, et al Multi-agent hierarchical policy gradient for air combat tactics emergence via self-play. Engineering Applications of Artificial Intelligence, 2021, 98, 104112. doi: 10.1016/j.engappai.2020.104112
30	HU J W, WANG L H, HU T M, et al Autonomous maneuver decision making of dual-UAV cooperative air combat based on deep reinforcement learning. Electronics, 2022, 11 (3): 467. doi: 10.3390/electronics11030467
31	LIU X, LIU S Y, ZHUANG Y K, et al Explainable reinforcement learning: basic problems exploration and method survey. Journal of Software, 2023, 34 (5): 2300- 2316.
32	DING L F, GENG F L. Principle of radar. 3rd ed. Xi’an: Xidian University Press, 2014. (in Chinese)
33	ZHANG H, WEI Y, ZHOU H, et al Maneuver decision-making for autonomous air combat based on FRE-PPO. Applied Sciences, 2022, 12 (20): 10230. doi: 10.3390/app122010230
34	AUSTIN F, CARBONNE G, FALCO M, et al. Automated maneuvering decisions for air-to-air combat. Guidance, Navigation and Control Conference, 1987: 2393.
35	XIAO B S, FANG Y W, HU S G, et al New threat assessment method in beyond-the-horizon range air combat. Systems Engineering and Electronics, 2009, 31 (9): 2163- 2166.
36	LI B, HUANG J Y, BAI S X, et al Autonomous air combat decision-making of UAV based on parallel self-play reinforcement learning. CAAI Transactions on Intelligence Technology, 2023, 8 (1): 64- 81. doi: 10.1049/cit2.12109
37	RODRIGUEZ-FDEZ I, CANOSA A, MUCIENTES M, et al. STAC: a web platform for the comparison of algorithms using statistical tests. Proc. of the IEEE International Conference on Fuzzy Systems, 2015. DOI: 10.1109/FUZZ-IEEE.2015.7337889.
38	PENG L X, ZHOU X Z, ZHAO J J, et al Three-way multi-attribute decision making under incomplete mixed environments using probabilistic similarity. Information Sciences, 2022, 614, 432- 463. doi: 10.1016/j.ins.2022.10.038
39	BAE J H, JUNG H, KIM S, et al Deep reinforcement learning-based air-to-air combat maneuver generation in a realistic environment. IEEE Access, 2023, 11, 26427- 26440. doi: 10.1109/ACCESS.2023.3257849
40	WANG H, WANG J T Enhancing multi-UAV air combat decision making via hierarchical reinforcement learning. Scientific Reports, 2024, 14 (1): 4458. doi: 10.1038/s41598-024-54938-5
41	JIANG S, WANG W, HUANG S B. Deep Q-network based target search of UAV under partially observable conditions. Proc. of the IEEE International Conference on Sensors, Electronics and Computer Engineering, 2023: 71−75.

Action	Component	Description
Maneuver	CE	Get the enemy’s position and accelerate towards it
	LTF	Get the positions of the own and enemy aircraft, calculate the azimuth angle of the enemy relative to the own aircraft, and then accelerate in the direction 90° to the left of the azimuth angle
	RTF	Get the positions of the own and enemy aircraft, calculate the azimuth angle of the enemy relative to the own aircraft, and then accelerate in the direction 90° to the right of the azimuth angle
	MG	The aircraft gets the enemy’s position and flies towards it at a constant velocity. Before the launched missile opens its radar, the aircraft shares the enemy position information with the missile. Once the missile opens its radar, the aircraft no longer provides the enemy position information
	E	Get the position of the missile attacking the aircraft, calculate the azimuth angle of the missile relative to the aircraft, then accelerate in the direction 180° opposite to the azimuth angle while simultaneously decreasing to evade the missile attack
Radar	C	Close the aircraft’s radar
Radar	O	Open the aircraft’s radar
Missile	L	When the conditions for missile launch are met (radar has locked onto the enemy, the missile is available, the enemy is within the maximum launch angle of the missile, the enemy is within the missile attack range relative to the own aircraft), initiate missile launch to attack the enemy
Missile	W	Do not initiate missile launch to attack the enemy

Side	Initial angle/(°)	Initial velocity/(m/s)	Initial position
Side	Initial angle/(°)	Initial velocity/(m/s)	x-axis/m	y-axis/m	z-axis/m
Red	[0,360]	[260,290]	[−100000,100000]	[−100000,100000]	[10000,15000]
Blue	[0,360]	[260,290]	Position is created based on the position of the red aircraft

Policy	V₃₀₀	V₂₅₀	V₂₀₀	V₁₅₀	V₁₀₀	V₅₀	V₁₀
V₃₀₀	−	1	0.016	0.016	0.016	0.016	0.016
V₂₅₀	1	−	0.016	0.016	0.016	0.016	0.016
V₂₀₀	0.016	0.016	−	0.016	0.016	0.016	0.016
V₁₅₀	0.016	0.016	0.016	−	0.016	0.016	0.016
V₁₀₀	0.016	0.016	0.016	0.016	−	0.016	0.016
V₅₀	0.016	0.016	0.016	0.016	0.016	−	0.016
V₁₀	0.016	0.016	0.016	0.016	0.016	0.016	−

Agent policy under evaluation	Policy V₂₅₀ trained by PS-PSP	Policy V₂₅₀ trained by PSP	Policy V₂₅₀ trained by PS-PSP
Opponent policy	Expert strategy	Expert strategy	Policy V₂₅₀ trained by PSP
Results (win: draw: lose)	88:8:4	67:17:16	60:17:23
Win rate of agent policy under evaluation	88%	67%	60%
Undefeated rate of agent policy under evaluation	96%	84%	77%
Lose rate of agent policy under evaluation	4%	16%	23%

Experiment	Training time/s	Mean reward	Mean step in each episode	Policy V₂₅₀ against the expert strategy
Experiment	Training time/s	Mean reward	Mean step in each episode	Win rate/%	Undefeated rate/%	Lose rate/%
Complete information	57933.6	347.9	245	90	95	5
Incomplete information with previous data or initial state	58205.5	472.7	344	88	96	4
Incomplete information padding 0	57865.3	350.6	380	76	85	15
Incomplete information padding −1	59680.2	385.4	395	80	90	10

Application of self-play reinforcement learning and explainable decision tree in intelligent air combat

RichHTML

PDF (PC)

Knowledge

Abstract

Cite this article

Share this article

Figures/Tables 23

References 41

Related Articles 9

Recommended Articles

Metrics

Comments

Index	Feature name	Feature description
1	can_attack_status	Aircraft has attack capability with a value of 1; otherwise, it has a value of 0
2	detect_radar_status	Aircraft has opened its detect radar with a value of 1; otherwise, it has a value of 0
3	is_attacked_status	Aircraft is attacked with a value of 1; otherwise, it has a value of 0
4	is_visible_status	Enemy aircraft is visible with a value of 1; otherwise, it has a value of 0
5	lock_target_status	Aircraft has locked the target with a value of 1; otherwise, it has a value of 0
6	meet_attack_condition	Attack condition is met with value 1; otherwise, it is with value 0
7	missile_fire_status	Missile has been launched with value 1; otherwise, it is with value 0
8	open_radar_status_missile	Enemy missile has opened its radar with 1; otherwise, it has a value of 0
9	relative_angle_missile	Relative angle between the own aircraft and the enemy missile
10	relative_angle_plane	Relative angle between the own aircraft and the enemy aircraft
11	relative_distance_missile	Relative distance between the own aircraft and the enemy missile
12	relative_distance_plane	Relative distance between the own aircraft and the enemy aircraft
13	relative_height_missile	Relative height between the own aircraft and the enemy missile
14	relative_height_plane	Relative height between the own aircraft and the enemy aircraft
15	relative_speed_missile	Relative speed between the own aircraft and the enemy missile
16	relative_speed_plane	Relative speed between the own aircraft and the enemy aircraft

Label	Maneuver					Radar		Missile
Label	0(CE)	1(LTF)	2(RTF)	3(MG)	4(E)	0(C)	1(O)	0(W)	1(L)
Win	147983	226	1241	1649	12968	141733	22334	41097	122970
Draw	11218	59	64	113	1217	11960	711	3273	9398
Lose	2207	126	120	144	1419	2934	1082	2283	1733
Sum	161408	411	1425	1906	15604	156627	24127	46653	134101

Agent policy under evaluation	Opponent policy	Results （win: draw: lose）	Win rate of agent policy under evaluation/%	Undefeated rate of agent policy under evaluation/%	Lose rate of agent policy under evaluation/%
Agent V₂₅₀ trained by PS-PSP	Expert	88:8:4	88	96	4
Explainable model	Expert	86:5:9	86	91	9

[1]	Nanxun DUO, Qinzhao WANG, Qiang LYU, Wei WANG. Tactical reward shaping for large-scale combat by multi-agent reinforcement learning [J]. Journal of Systems Engineering and Electronics, 2024, 35(6): 1516-1529.
[2]	Guofei LI, Shituo LI, Bohao LI, Yunjie WU. Deep reinforcement learning guidance with impact time control [J]. Journal of Systems Engineering and Electronics, 2024, 35(6): 1594-1603.
[3]	Guang ZHAN, Kun ZHANG, Ke LI, Haiyin PIAO. UAV maneuvering decision-making algorithm based on deep reinforcement learning under the guidance of expert experience [J]. Journal of Systems Engineering and Electronics, 2024, 35(3): 644-665.
[4]	Jiawei XIA, Xufang ZHU, Zhong LIU, Qingtao XIA. LSTM-DPPO based deep reinforcement learning controller for path following optimization of unmanned surface vehicle [J]. Journal of Systems Engineering and Electronics, 2023, 34(5): 1343-1358.
[5]	Yaozhong ZHANG, Zhuoran WU, Zhenkai XIONG, Long CHEN. A UAV collaborative defense scheme driven by DDPG algorithm [J]. Journal of Systems Engineering and Electronics, 2023, 34(5): 1211-1224.
[6]	Yaozhong ZHANG, Yike LI, Zhuoran WU, Jialin XU. Deep reinforcement learning for UAV swarm rendezvous behavior [J]. Journal of Systems Engineering and Electronics, 2023, 34(2): 360-373.
[7]	Ang GAO, Qisheng GUO, Zhiming DONG, Zaijiang TANG, Ziwei ZHANG, Qiqi FENG. Research on virtual entity decision model for LVC tactical confrontation of army units [J]. Journal of Systems Engineering and Electronics, 2022, 33(5): 1249-1267.
[8]	Bohao LI, Yunjie WU, Guofei LI. Hierarchical reinforcement learning guidance with threat avoidance [J]. Journal of Systems Engineering and Electronics, 2022, 33(5): 1173-1185.
[9]	Kaifang WAN, Bo LI, Xiaoguang GAO, Zijian HU, Zhipeng YANG. A learning-based flexible autonomous motion control method for UAV in dynamic unknown environments [J]. Journal of Systems Engineering and Electronics, 2021, 32(6): 1490-1508.