Journal of Systems Engineering and Electronics ›› 2009, Vol. 20 ›› Issue (3): 651-659.

• SOFTWARE ALGORITHM AND SIMULATION • Previous Articles     Next Articles

Lazy learner text categorization algorithm based on embedded feature selection

Yan Peng1,2, Zheng Xuefeng1, Zhu Jianyong2 & Xiao Yunhong1   

  1. 1. Information Engineering School, Univ. Science and Technology Beijing, Beijing 100083, P. R. China;
    2. China State Information Center, Beijing 100045, P. R. China
  • Online:2009-06-23 Published:2010-01-03

Abstract:

To avoid the curse of dimensionality, text categorization (TC) algorithms based on machine learning (ML) have to use an feature selection (FS) method to reduce the dimensionality of feature space. Although having been widely used, FS process will generally cause information losing and then have much side-effect on the whole performance of TC algorithms. On the basis of the sparsity characteristic of text vectors, a new TC algorithm based on lazy feature selection (LFS) is presented. As a new type of embedded feature selection approach, the LFS method can greatly reduce the dimension of features without any information losing, which can improve both efficiency and performance of algorithms greatly. The experiments show the new algorithm can simultaneously achieve much higher both performance and efficiency than some of other classical TC algorithms.