Font Size: a A A

The Research On Text Categorization Algorithm Based On Support Vector Machine

Posted on:2008-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:Q M WuFull Text:PDF
GTID:2178360215479891Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet, the information on the Internet increases exponentially. One important research focuses on how to deal with these great capacities of online documents. As one of the crucial parts of information retrieval, text classification has become an important research direction. Support vector machines(SVM), as a machine learning method based on statistical learning theory, can resolve such practical problems as nonlinearity, high dimension and local minima. This thesis mainly focuses on the drawbacks of SVM in the practical application including text categorization.This thesis firstly introduces general development and some techniques of text categorization. Then, the statistical learning theory and SVM was introduced, lay basic theoretical for the research in the following chapters.We put up an experimentation platform, and test some usual text categorization algorithm from which get the result that SVM are particularly suited for text categorization.Since SVM is very sensitive to noises in the training set, a support vector machine algorithm based on training repeatedly is proposed in this thesis. Samples having effects on decision surface after being trained repeatedly are chosen. And then they are trained repeatedly for some times according to their fuzzy membership. The weight of these samples is changed by this way and reduced in the influence of noises. The improved SVM algorithm is employed to text categorization, though the training time is increased, better effect is obtained than the traditional support vector machine, and this method effectively distinguishes between the valid samples and the noises.In this thesis another improved support vector machine is presented to enhance the classification performance. In the proposed algorithm the class center is calculated, and the samples closing to the class center are chosen, renamed and added to the training set to strength their weight. Therefore, the representative samples are emphasized. Then the expanded training set is inputted for training the support vector machine. The improved support vector machine algorithm is employed to Chinese text categorization, though the training time is increased, the better performance is obtained compared to traditional support vector machine.
Keywords/Search Tags:Text categorization, Chinese web categorization, Support vector machine, Membership, Class center
PDF Full Text Request
Related items