Font Size: a A A

Automatic Chinese Webpages Classification Based On Projection Pursuit

Posted on:2005-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y WanFull Text:PDF
GTID:2168360122494248Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
How to classify the document information of Internet? This important problem needs to be solved in document information processing and computer research domain.In general, text is represented by vector model, this representation raised the high dimensionality of the feature space. If we use this high dimensional vector for text classification, the curse of dimensionality will be raised, so we should use dimensional reduction to avoid this problem for classification. Most text classifications reduce dimensionality by using feature selection that can choose a subset from the original feature set according to some criterions, this method may neglect some relevant factor. Another method is feature extraction. Feature extraction is a process that extracts a set of new feature from the original feature through some functional mapping, such as Principal Component Analysis and Fisher Linear Discriminant Analysis. Those techniques have an assumption of normal distribution. Text data does not satisfy the assumption of normality that those methods are based on ,so we need a Robust or nonparametric method to resolve this problem.Based on above reasons, we propose a Chinese Webpages classification algorithm based on Projection Pursuit. The procedure is to project the data from a high dimensional space to a lower dimensional subspace, and find the projection direction that can reflect the construction and feature of the high dimensional data, and then the text is projected to this direction. The distribution of data in a lower dimensional subspace that is a result of projection from the original high dimensional space will discover the construction of the high dimensional space.The main creative points of this paper are:(1) We use genetic algorithm to search the best projection direction without the assumption of normality, and text vector is projected to a one dimensional space.(2) Projection Pursuit is used firstly text classification. After text vector is projected to one dimensional space, we classify the test-set using KNN algorithm. The result of experiment shows that the recall and precision are better than other method.We also do some experiment by using Similarity method and navie Bayes method. The experiment shows that Projection Pursuit not only has better recall and precision, but also has better stability.
Keywords/Search Tags:Projection Pursuit, Chinese Webpages Classification, Text Classification, Dimension Reduction, Genetic Algorithm
PDF Full Text Request
Related items