Font Size: a A A

The Research Of Web Page Classification Based On SVM

Posted on:2005-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y L WuFull Text:PDF
GTID:2168360125950489Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The information resource on the web has covered various fields with the rapid development of internet, especially the prevalence of World Wide Web. To solve the problem of Information Overload, the techniques of web mining and web information retrieval have been greatly developed.An important method to deal with large-scale data is to classify them. The automatic web page classification is a very important research direction of the information retrieval (IR) field. It not only can set up corresponding web page database separately according to classification information, but also can improve recall and precision of the search engine. It also can set up automatic categorized information resource and offer the classified information catalogue to users. Text classification is a specific pattern recognition question. Using machine learning method of pattern recognition will make better result than relevance feedback method among texts. If text classification had been regarded as the information retrieval question for a time, it is researched as a special case of pattern recognition now. A large number of classical pattern recognition learning algorithm has applied into text classification already, such as near neighbor's classification, Bayes decision way, decision tree, neural network and support vector machine etc.This paper generally discusses all kinds of techniques mentioned in text automatic classification. Two crucial techniques-feature selection and classification learning algorithm tested by experiments. We have done the work of the following aspects mainly in this text:1. Text feature selection technique Feature selection is an important process in text classification. Because text feature quantity collected is very huge, general learning algorithm is not able to go on classification learn on it and make the extraction of feature subset very essential. Feature selection is able to improve systematic function from two respects: First, classification speed. By feature selection, the feature in the feature set and the text feature vector dimension can be reduced greatly. So it improves the systematic speed of operation. Second, accuracy. The proper feature selection not only will not reduce systematic accuracy but also will make the systematic precision improved. We compare many feature selection algorithms, then adopt CHI algorithm. At the same time we analyzed structures that contribute to the categorization in web pages, and adjust weight by different web page mark. We improve TF-IDF formula and make it more suitable for the automatic classification further.2. Text classification algorithm.It gives a stress on SVM. Basing on the Statistical Learning Theory (SLT),the thesis discusses the SVM problems in linearly separable case, linearly non-separable case and non-linear separable case,and induces a convex quadratic programming (QP) problem with an equation constrain and non-equation constrains. Then one program on solving the QP problem is proposed. For learning larger-scales texts corpus by SVM, it is important that decomposition method optimizes the SVM with respect to subsets and recursively solves the whole SVM. The ξα estimator based on Leave-One-Out test can perform efficiently and effectively estimating in term of error rate, precision, recall and F1. The thesis proves the effectiveness of the decomposition method. Five measures for reducing learning time are adopted. We regard support vector machine as foundation then improve and expand it, adopt combining structure to realize N classes support vector machine classification, have set up system "Clearcut" of classification that combine the level and link information and belonged to more based on SVM.There are mainly three respects that were improved and expanded: (1)The individual classifier is merged into more. People find that combining a lot of decision results of classifier together can receive the better performance than the individual classifier from a large amount of experiment such as higher precision and l...
Keywords/Search Tags:Classification
PDF Full Text Request
Related items