Journal of Systems Engineering and Electronics ›› 2019, Vol. 30 ›› Issue (5): 1035-1043.doi: 10.21629/JSEE.2019.05.19

• Reliability • Previous Articles     Next Articles

Fault diagnosis based on dial-test data in datacenter networks

Xiaogang QI1,3(), Bingchun WANG1,*(), Lifang LIU2,3()   

  1. 1 School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
    2 School of Computer Science and Technology, Xidian University, Xi'an 710071, China
    3 Xidian-Ningbo Information Technology Institute, Ningbo 315200, China
  • Received:2018-03-20 Online:2019-10-08 Published:2019-10-09
  • Contact: Bingchun WANG E-mail:xgqi@xidian.edu.cn;1066955863@qq.com;lfliu@xidian.edu.cn
  • About author:QI Xiaogang was born in 1973. He is a professor and Ph.D. supervisor in School of Mathematics and Statistics of Xidian University. He received his Ph.D. degree in applied mathematics from Xidian University in 2005, and joined as a faculty member in the same university in 2002, and an associate professor in 2006. From September 2012 to August 2013, he was a visiting scholar in School of Electrical, Computer and Energy Engineering of Arizona State University, Tempe, AZ. His current research interests include system modeling and simulation, resource management and schedule, performance evaluation and optimization algorithm design, and fault diagnosis in various networks. E-mail: xgqi@xidian.edu.cn|WANG Bingchun was born in 1997. She received her B.S. degree from Guilin University of Technology, China, in 2016. She is now a graduate student at Xidian University. Her current research interests include data analysis and fault diagnosis of networks. E-mail: 1066955863@qq.com|LIU Lifang was born in 1972. She received her Ph.D. degree in computer application from Xidian University in 2006, and became an associate professor in the same year. Since 2015, she has become a professor in School of Computer Science and Technology of Xidian University. Her current research interests include computer network, algorithm design and analysis, data processing and intelligent calculation. E-mail: lfliu@xidian.edu.cn
  • Supported by:
    the National Natural Science Foundation of China(61877067);the National Natural Science Foundation of China(61572435);the joint fund project of the Ministry of Education-the China Mobile(MCM20170103);Xi'an Science and Technology Innovation Project(201805029YD7CG13-6);Ningbo Natural Science Foundation(2016A610035);Ningbo Natural Science Foundation(2017A610119);This work was supported by the National Natural Science Foundation of China (61877067; 61572435), the joint fund project of the Ministry of Education-the China Mobile (MCM20170103), Xi'an Science and Technology Innovation Project (201805029YD7CG13-6), and Ningbo Natural Science Foundation (2016A610035; 2017A610119)

Abstract:

The fast growth of datacenter networks, in terms of both scale and structural complexity, has led to an increase of network failure and hence brings new challenges to network management systems. As network failure such as node failure is inevitable, how to find fault detection and diagnosis approaches that can effectively restore the network communication function and reduce the loss due to failure has been recognized as an important research problem in both academia and industry. This research focuses on exploring issues of node failure, and presents a proactive fault diagnosis algorithm called heuristic breadth-first detection (HBFD), through dynamically searching the spanning tree, analyzing the dial-test data and choosing a reasonable threshold to locate fault nodes. Both theoretical analysis and simulation results demonstrate that HBFD can diagnose node failures effectively, and take a smaller number of detection and a lower false rate without sacrificing accuracy.

Key words: datacenter network, node failure, proactive fault diagnosis