Over-sampling algorithm for imbalanced data classification

doi:10.21629/JSEE.2019.06.12

Journal of Systems Engineering and Electronics ›› 2019, Vol. 30 ›› Issue (6): 1182-1191.doi: 10.21629/JSEE.2019.06.12

• Systems Engineering • Previous Articles Next Articles

Over-sampling algorithm for imbalanced data classification

Xiaolong XU^1,*(), Wen CHEN²(), Yanfei SUN³()

¹ Jiangsu Key Laboratory of Big Data Security & Intelligent Processing, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
² Institute of Big Data Research at Yancheng, Nanjing University of Posts and Telecommunications, Yancheng 224000, China
³ Office of Scientific R&D, Nanjing University of Posts and Telecommunications, Nanjing 210023, China

Received:2018-06-25 Online:2019-12-20 Published:2019-12-25
Contact: Xiaolong XU E-mail:xuxl@njupt.edu.cn;1216043012@njupt.edu.cn;sunyanfei@njupt.edu.cn
About author:XU Xiaolong was born in 1977. He received his B.S. in computer and its applications, M.S. in computer software and theories and Ph.D. degree in communications and information systems at Nanjing University of Posts & Telecommunications, Nanjing, China, in 1999, 2002 and 2008, respectively. He worked as a postdoctoral researcher at the Station of Electronic Science and Technology, Nanjing University of Posts & Telecommunications from 2011 to 2013. He is currently a professor in College of Computer, Nanjing University of Posts & Telecommunications. He is a senior member of China Computer Federation. His current research interests include cloud computing and big data, mobile computing, intelligent agent and information security. E-mail: xuxl@njupt.edu.cn|CHEN Wen was born in 1994. He received his B.E. degree in computer science and technology from Anhui Engineering University, Wuhu, China, in 2016. He works as an engineer in Institute of Big Data Research at Yancheng, Nanjing University of Posts and Telecommunications, Yancheng, China, carrying out research in data analysis. E-mail: 1216043012@njupt.edu.cn|SUN Yanfei was born in 1976. He received his Ph.D. degree in communications and information systems at Nanjing University of Posts & Telecommunications, Nanjing, China, in 2006. He is currently a professor and the director in Science and Technology Department, Nanjing University of Posts & Telecommunications. His current research interests include communication network, mobile networks and big data. E-mail: sunyanfei@njupt.edu.cn
Supported by:
the National Key Research and Development Program of China(2018YFB1003700);the Scientific and Technological Support Project (Society) of Jiangsu Province(BE2016776);the "333" project of Jiangsu Province(BRA2017228);the "333" project of Jiangsu Province(BRA2017401);the Talent Project in Six Fields of Jiangsu Province(2015-JNHB-012);This work was supported by the National Key Research and Development Program of China (2018YFB1003700), the Scientific and Technological Support Project (Society) of Jiangsu Province (BE2016776), the "333" project of Jiangsu Province (BRA2017228; BRA2017401), and the Talent Project in Six Fields of Jiangsu Province (2015-JNHB-012)

Abstract

Abstract:

For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique (SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The density-based spatial clustering of applications with noise (DBSCAN) is not rigorous when dealing with the samples near the borderline. We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique (DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples, different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.

Key words: imbalanced data, density-based spatial clustering of applications with noise (DBSCAN), synthetic minority over-sampling technique (SMOTE), over-sampling

Xiaolong XU, Wen CHEN, Yanfei SUN. Over-sampling algorithm for imbalanced data classification[J]. Journal of Systems Engineering and Electronics, 2019, 30(6): 1182-1191.

Figures/Tables 20

Fig 1

Table 1

Table 2

Fig 2

Table 3

Table 4

Table 5

Table 6

Fig 3

Fig 4

Fig 5

Fig 6

Fig 7

Fig 8

Fig 9

Fig 10

Fig 11

Fig 12

Fig 13

Fig 14

References 38

1	TAN X P, SU S J, HUANG Z P, et al. Wireledss sensor networks intrusion detection based on SMOTE and the random forest algorithm. Sensors, 2019, 19 (1): 203.
2	LI C L, LIU S G. A comparative study of the class imbalance problem in Twitter spam detection. Concurrency and Computation: Practice and Experience, 2017, 30 (5): e4281.
3	LI Y L, SUN G S, ZHU Y H. Data imbalance problem in text classification. Proc. of the 3rd International Symposium on Information Processing, 2010, 301- 305.
4	ZHU M, XIA J, JIN X Q, et al. Class weights random forest algorithm for processing class imbalanced medical data. IEEE Access, 2018, 6, 4641- 4652. doi: 10.1109/ACCESS.2018.2789428
5	WEI X. Research of ensemble classification methods for class-imbalance and cost-sensitive datasets. Hefei, China: University of Science and Technology of China, 2017.
6	CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: improving prediction of the minority class in boosting. Proc. of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2003, 107- 119.
7	FREUND Y. Experiment with a new boosting algorithm. Proc. of the 13th International Conference on Machine Learning, 1996, 148- 156.
8	FAN W, STOLFO S J, ZHANG J. AdaCost: misclassification cost-sensitive boosting. Proc. of the 6th International Conference on Machine Learning, 1997, 97- 105.
9	CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2011, 16 (1): 321- 357.
10	HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Proc. of the International Conference on Advances in Intelligent Computing, 2005, 878- 887.
11	ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. of the International Conference on Knowledge Discovery and Data Mining, 1996, 226- 231.
12	WASIKOWSKI M. Combating the class imbalance problem in small sample data sets. Kansas, USA: University of Kansas, 2009.
13	JOSHI M V, KUMAR V, AGARWAL R C. Evaluating boosting algorithms to classify rare classes: comparison and improvements. Proc. of the IEEE International Conference on Data Mining, 2001, 257- 264.
14	WU G, CHANG E Y. Class-boundary alignment for imbalanced data set learning. Proc. of the Workshop on Learning from Imbalanced Data Sets, 2003, 49- 56.
15	HUANG K Z, YANG H Q, KING I, et al. Imbalanced learning with a biased minimax probability machine. IEEE Trans. on Systems, Man and Cybernetics, 2006, 36 (4): 913- 923. doi: 10.1109/TSMCB.2006.870610
16	TOMEK I. Two modifications of CNN. IEEE Trans. on Systems, Man and Cybernetics, 1976, 6 (11): 769- 772.
17	SÁEZ J A, LUENGO J, STEFANOWSKI J, et al. SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences, 2015, 291, 184- 203. doi: 10.1016/j.ins.2014.08.051
18	MA L, FAN S H. CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests. BMC Bioinformatics, 2017, 18, 169. doi: 10.1186/s12859-017-1578-z
19	DONG Y J, WANG X H. A new over-sampling approach: random-SMOTE for learning from imbalanced data sets. Proc. of the 5th International Conference on Knowledge Science, Engineering and Management, 2011, 343- 352.
20	HE H B, BAI Y, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning. Proc. of the IEEE World Congress on Computational Intelligence, 2008, 1322- 1328.
21	BUNKHUMPORNPAT C, SINAPIROMSARAN K, LURSINSAP C. DBSMOTE: density-based synthetic minority over-sampling technique. Applied Intelligence, 2012, 36 (3): 664- 684. doi: 10.1007/s10489-011-0287-y
22	UTIÉRREZ P D, LASTRA M, BENÍTEZ J M, et al. SMOTE-GPU: big data preprocessing on commodity hardware for imbalanced classification. Progress in Artificial Intelligence, 2017, 6 (4): 347- 354. doi: 10.1007/s13748-017-0128-2
23	ZHOU C S, LIU B, WANG S H. CMO-SMOTE: misclassification cost minimization oriented synthetic minority oversampling technique for imbalanced learning. Proc. of the 8th International Conference on Intelligent Human-Machine Systems and Cybernetics, 2016, 353- 358.
24	ZHANG C, CHEN Y E, LIU X H, et al. Abstention-SMOTE: an over-sampling approach for imbalanced data classification. Proc. of the International Conference on Information Technology, 2017, 17- 21.
25	ZHANG Y, ZHANG H, ZHANG X, et al. Deep learning intrusion detection model based on optimized imbalanced network data. Proc. of the 18th International Conference on Communication Technology, 2018, 1128- 1132.
26	JIANG K, LU J, XIA K L. A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE. Arabian Journal for Science and Engineering, 2016, 41 (8): 3255- 3266. doi: 10.1007/s13369-016-2179-2
27	PRUSTY M R, JAYANTHI T, VELUSAMY K. Weighted-SMOTE: a modification to SMOTE for event classification in sodium cooled fast reactors. Progress in Nuclear Energy, 2017, 100 (9): 355- 364.
28	GONG C L, GU L X. A novel SMOTE-based classification approach to online data imbalance problem. Mathematical Problems in Engineering, 2016, 5685970.
29	XUE W, ZHANG J. Dealing with imbalanced dataset: are-sampling method based on the improved SMOTE algorithm. Communications in Statistics-Simulation and Computation, 2016, 45 (4): 1160- 1172. doi: 10.1080/03610918.2012.728274
30	SU P H, LIU Y H, SONG X. Research on intrusion detection method based on improved smote and XGBoost. Proc. of the 8th International Conference on Communication and Network Security, 2018, 37- 41.
31	BHAGAT R C, PATIL S S. Enhanced SMOTE algorithm for classification of imbalanced big-data using random forest. Proc. of the IEEE International Advance Computing Conference, 2015, 403- 408.
32	DEMIDOVA L, KLYUEVA I. SVM classification: optimization with the SMOTE algorithm for the class imbalance problem. Proc. of the 6th Mediterranean Conference on Embedded Computing, 2017, 1- 4.
33	JUNSOMBOON N, PHIENTHRAKUL T. Combining over-sampling and under-sampling techniques for imbalance dataset. Proc. of the 9th International Conference on Machine Learning and Computing, 2017, 243- 247.
34	GOSAIN A, SARDANA S. Farthest SMOTE: a modified SMOTE approach. Proc. of the International Conference on Computational Intelligence in Data Mining, 2017, 309- 320.
35	SUN J, LANG J, FUJITA H, et al. Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Information Sciences, 2018, 425, 76- 91. doi: 10.1016/j.ins.2017.10.017
36	HARLIMAN K U R, UCHIDA K. Data-and algorithm-hybrid approach for imbalanced data problems in deep neural network. International Journal of Machine Learning and Computing, 2018, 8 (3): 208- 213. doi: 10.18178/ijmlc.2018.8.3.689
37	TAY F E H, SHEN L. A modified Chi2 algorithm for discretization. IEEE Trans. on Knowledge and Data Engineering, 2002, 14 (3): 666- 670. doi: 10.1109/TKDE.2002.1000349
38	BAY S D. The UCI KDD repository. http://kdd.ics.uci.edu.

Actual label	Predicted positive	Predicted negative
Positive	$TP$	$FN$
Negative	$FP$	$TN$

Dataset	Label	#Attr	#Minor	#Major	IL
Pima	1	8	268	500	1.87
Breast-w	4	10	241	458	1.9
Vehicle	0	18	226	946	3.85
Ecoli	1	7	77	336	4.37

Method	$N$/%	${\rm{Precision}}$	${\rm{Recall}}$	${\rm{F}}$-${\rm{value}}$
Original	$N$/%	0.606	0.563	0.584
SMOTE	100	0.565	0.737	0.64
	200	0.547	0.768	0.639
	300	0.531	0.787	0.634
	400	0.533	0.813	0.643
	500	0.522	0.809	0.634
DSMOTE	100	0.574	0.737	0.646
	200	0.550	0.795	0.65
	300	0.537	0.815	0.647
	400	0.529	0.843	0.65
	500	0.515	0.856	0.643
Borderline-SMOTE	100	0.545	0.763	0.636
	200	0.524	0.789	0.629
	300	0.509	0.791	0.619
	400	0.513	0.814	0.629
	500	0.504	0.803	0.643

Method	$N$/%	${\rm{Precision}}$	${\rm{Recall}}$	${\rm{F}}$-${\rm{value}}$
Original	$N$/%	0.910	0.892	0.901
SMOTE	100	0.906	0.939	0.922
	200	0.905	0.946	0.925
	300	0.906	0.943	0.924
	400	0.906	0.954	0.929
	500	0.909	0.959	0.933
DSMOTE	100	0.913	0.953	0.932
	200	0.906	0.954	0.929
	300	0.909	0.953	0.930
	400	0.909	0.954	0.931
	500	0.910	0.963	0.935
Borderline-SMOTE	100	0.907	0.952	0.929
	200	0.906	0.950	0.927
	300	0.906	0.948	0.927
	400	0.905	0.946	0.925
	500	0.913	0.954	0.933

Method	$N$/%	${\rm{Precision}}$	${\rm{Recall}}$	${\rm{F}}$-${\rm{value}}$
Original	$N$/%	0.874	0.874	0.874
SMOTE	100	0.890	0.894	0.892
	200	0.902	0.879	0.891
	300	0.881	0.854	0.867
	400	0.875	0.844	0.859
	500	0.887	0.829	0.857
DSMOTE	100	0.901	0.915	0.908
	200	0.905	0.910	0.907
	300	0.894	0.889	0.892
	400	0.916	0.874	0.895
	500	0.901	0.864	0.882
Borderline-SMOTE	100	0.894	0.894	0.894
	200	0.890	0.854	0.872
	300	0.874	0.834	0.853
	400	0.887	0.864	0.875
	500	0.861	0.839	0.850

Over-sampling algorithm for imbalanced data classification

RichHTML

PDF (PC)

Knowledge

Abstract

Cite this article

Share this article

Figures/Tables 20

References 38

Related Articles 1

Recommended Articles

Metrics

Comments

Method	$N$/%	${\rm{Precision}}$	${\rm{Recall}}$	${\rm{F}}$-${\rm{value}}$
Original	$N$/%	0.756	0.766	0.761
SMOTE	100	0.739	0.883	0.805
	200	0.734	0.896	0.807
	300	0.701	0.883	0.782
	400	0.693	0.909	0.787
	500	0.697	0.896	0.784
DSMOTE	100	0.737	0.909	0.814
	200	0.723	0.948	0.820
	300	0.711	0.896	0.793
	400	0.699	0.935	0.800
	500	0.706	0.936	0.805
Borderline-SMOTE	100	0.697	0.896	0.784
	200	0.683	0.922	0.785
	300	0.66	0.882	0.756
	400	0.645	0.922	0.759
	500	0.642	0.909	0.753