Research On Text Categorization Based On Kernel PCA And RBF Neural Network

Posted on:2010-08-01

Degree:Master

Type:Thesis

Country:China

Candidate:J Yang

Full Text:PDF

GTID:2178360302459708

Subject:Circuits and Systems

Abstract/Summary:

PDF Full Text Request

Text categorization (TC) is important basis for information retrieval and text mining. The aim of TC can be defined as assigning category labels to texts based on their content, and the category labels are defined beforehand. So far, the vast majority work of TC is accomplished by humans. However, we are now living in a society where information explodes. Therefore, the traditional manual TC can no longer meet the need, and automatic TC based on artificial intelligence has become an important research field in natural language processing.Firstly, we discuss the systematic architecture and key technologies of TC system. Furthermore, deep study and analysis on the algorithms used in sub-modules of a TC system is made. Through horizontal comparison, we analyze the advantages and disadvantages of all kinds of algorithms, especially text representation method, dimensionality reduction algorithm and TC algorithm.Neural Network (NN) possesses strong capability in learning, associative memory and error-tolerance. Furthermore, it can process data in high-speed, distributed and parallel way. In addition to the merits mentioned above, RBF NN has got the characteristics of faster convergence speed, global optimization and simpler network structure. Therefore, this paper tries to apply RBF NN to TC and experiments are made based on traditional feature selection algorithms and RBF NN.Then, we make a deep research into two kinds of dimensionality reduction algorithms included feature selection and feature extraction, and we pointed out the limitation and weakness of feature selection theoretically--if we try to find a optimal feature subset or sub-optimal feature subset, the computation will be unfeasible; but if we construct a evaluation function to find features which meet with a certain optimal criteria to reduce the computational complexity, the problem come up with that we have no guarantee to find an optimal feature subset or even a sub-optimal feature subset.In response with the problem mentioned above and the characteristics of non-linearity, super-high dimensionality and complex correlation between features existed in text data, we introduced a feature extraction algorithm based on kernel principal component analysis(KPCA), and we make a deep theoretical and feasible analysis on applying KPCA into text dimensionality reduction. NN was used not much in the field of TC, mainly because the dimensionality of the input text space is too high, which restrict the performance of NN. However, the introduction of KPCA will remedy this. To this aim, this paper proposes an algorithm based on KPCA and RBF NN. First of all, the algorithm transforms the input space into high-dimension feature space to eliminate the non-linearity of the text features. Then, PCA is implemented in feature space to obtain "Principal Component", which will remove the complex correlation between features. Dimensionality reduction will be achieved by projecting input vectors of input space on the principal component vectors. Last, the semantic features acquired through dimensionality reduction will be used to train a RBF NN. Experiments show that the algorithm we raise can effectively reduce the dimensionality of input space. Besides, the algorithm can also improve the classification performance of RBF NN, which makes RBF NN suitable for large-scale real time TC.

Keywords/Search Tags:

Automated Text Categorization, Feature Selection, Feature Extraction, Principal Component Analysis, Kernel Principal Component Analysis, RBF Neural Network

PDF Full Text Request

Related items

1	Research On Feature Extraction Based On Principal Component Analysis
2	Research On Appearance-based Statistical Face Recognition
3	Research On Finger Vein Feature Extraction Algorithm
4	Research On Feature Extraction Technologies Of Complex Components Internal Structure
5	Research On Feature Transformation Based On Kernel Principal Component Analysis
6	Research On Feature Selection Algorithm Based On Kernel Sparse And Principal Component Analysis
7	Research On Face Recognition Algorithm Based On Principal Component Analysis
8	Principal Component Analysis And Its Application In Feature Extraction
9	Research On Kernel Projection Analysis Based Feature Extraction And Applications
10	Image Feature Extraction Technology Based On Incremental Two-Dimensional Principal Component Analysis