Font Size: a A A

Research On Text Categorization Based On Kernel PCA And RBF Neural Network

Posted on:2010-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:J YangFull Text:PDF
GTID:2178360302459708Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Text categorization (TC) is important basis for information retrieval and text mining. The aim of TC can be defined as assigning category labels to texts based on their content, and the category labels are defined beforehand. So far, the vast majority work of TC is accomplished by humans. However, we are now living in a society where information explodes. Therefore, the traditional manual TC can no longer meet the need, and automatic TC based on artificial intelligence has become an important research field in natural language processing.Firstly, we discuss the systematic architecture and key technologies of TC system. Furthermore, deep study and analysis on the algorithms used in sub-modules of a TC system is made. Through horizontal comparison, we analyze the advantages and disadvantages of all kinds of algorithms, especially text representation method, dimensionality reduction algorithm and TC algorithm.Neural Network (NN) possesses strong capability in learning, associative memory and error-tolerance. Furthermore, it can process data in high-speed, distributed and parallel way. In addition to the merits mentioned above, RBF NN has got the characteristics of faster convergence speed, global optimization and simpler network structure. Therefore, this paper tries to apply RBF NN to TC and experiments are made based on traditional feature selection algorithms and RBF NN.Then, we make a deep research into two kinds of dimensionality reduction algorithms included feature selection and feature extraction, and we pointed out the limitation and weakness of feature selection theoretically--if we try to find a optimal feature subset or sub-optimal feature subset, the computation will be unfeasible; but if we construct a evaluation function to find features which meet with a certain optimal criteria to reduce the computational complexity, the problem come up with that we have no guarantee to find an optimal feature subset or even a sub-optimal feature subset.In response with the problem mentioned above and the characteristics of non-linearity, super-high dimensionality and complex correlation between features existed in text data, we introduced a feature extraction algorithm based on kernel principal component analysis(KPCA), and we make a deep theoretical and feasible analysis on applying KPCA into text dimensionality reduction. NN was used not much in the field of TC, mainly because the dimensionality of the input text space is too high, which restrict the performance of NN. However, the introduction of KPCA will remedy this. To this aim, this paper proposes an algorithm based on KPCA and RBF NN. First of all, the algorithm transforms the input space into high-dimension feature space to eliminate the non-linearity of the text features. Then, PCA is implemented in feature space to obtain "Principal Component", which will remove the complex correlation between features. Dimensionality reduction will be achieved by projecting input vectors of input space on the principal component vectors. Last, the semantic features acquired through dimensionality reduction will be used to train a RBF NN. Experiments show that the algorithm we raise can effectively reduce the dimensionality of input space. Besides, the algorithm can also improve the classification performance of RBF NN, which makes RBF NN suitable for large-scale real time TC.
Keywords/Search Tags:Automated Text Categorization, Feature Selection, Feature Extraction, Principal Component Analysis, Kernel Principal Component Analysis, RBF Neural Network
PDF Full Text Request
Related items