Journal of Systems Engineering and Electronics ›› 2019, Vol. 30 ›› Issue (6): 1182-1191.doi: 10.21629/JSEE.2019.06.12

• Systems Engineering • Previous Articles     Next Articles

Over-sampling algorithm for imbalanced data classification

Xiaolong XU1,*(), Wen CHEN2(), Yanfei SUN3()   

  1. 1 Jiangsu Key Laboratory of Big Data Security & Intelligent Processing, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
    2 Institute of Big Data Research at Yancheng, Nanjing University of Posts and Telecommunications, Yancheng 224000, China
    3 Office of Scientific R&D, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
  • Received:2018-06-25 Online:2019-12-20 Published:2019-12-25
  • Contact: Xiaolong XU E-mail:xuxl@njupt.edu.cn;1216043012@njupt.edu.cn;sunyanfei@njupt.edu.cn
  • About author:XU Xiaolong was born in 1977. He received his B.S. in computer and its applications, M.S. in computer software and theories and Ph.D. degree in communications and information systems at Nanjing University of Posts & Telecommunications, Nanjing, China, in 1999, 2002 and 2008, respectively. He worked as a postdoctoral researcher at the Station of Electronic Science and Technology, Nanjing University of Posts & Telecommunications from 2011 to 2013. He is currently a professor in College of Computer, Nanjing University of Posts & Telecommunications. He is a senior member of China Computer Federation. His current research interests include cloud computing and big data, mobile computing, intelligent agent and information security. E-mail: xuxl@njupt.edu.cn|CHEN Wen was born in 1994. He received his B.E. degree in computer science and technology from Anhui Engineering University, Wuhu, China, in 2016. He works as an engineer in Institute of Big Data Research at Yancheng, Nanjing University of Posts and Telecommunications, Yancheng, China, carrying out research in data analysis. E-mail: 1216043012@njupt.edu.cn|SUN Yanfei was born in 1976. He received his Ph.D. degree in communications and information systems at Nanjing University of Posts & Telecommunications, Nanjing, China, in 2006. He is currently a professor and the director in Science and Technology Department, Nanjing University of Posts & Telecommunications. His current research interests include communication network, mobile networks and big data. E-mail: sunyanfei@njupt.edu.cn
  • Supported by:
    the National Key Research and Development Program of China(2018YFB1003700);the Scientific and Technological Support Project (Society) of Jiangsu Province(BE2016776);the "333" project of Jiangsu Province(BRA2017228);the "333" project of Jiangsu Province(BRA2017401);the Talent Project in Six Fields of Jiangsu Province(2015-JNHB-012);This work was supported by the National Key Research and Development Program of China (2018YFB1003700), the Scientific and Technological Support Project (Society) of Jiangsu Province (BE2016776), the "333" project of Jiangsu Province (BRA2017228; BRA2017401), and the Talent Project in Six Fields of Jiangsu Province (2015-JNHB-012)

Abstract:

For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique (SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The density-based spatial clustering of applications with noise (DBSCAN) is not rigorous when dealing with the samples near the borderline. We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique (DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples, different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.

Key words: imbalanced data, density-based spatial clustering of applications with noise (DBSCAN), synthetic minority over-sampling technique (SMOTE), over-sampling