Font Size: a A A

Research And Realization Of Clustering Guided Web Chinese Text Classification Based On SVM

Posted on:2005-06-08Degree:MasterType:Thesis
Country:ChinaCandidate:J Y ZhangFull Text:PDF
GTID:2168360122967541Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the development of Internet, network information increases rapidly. In order to make the information service more efficient and precise, it is important to get the information in Internet organized and classified reasonably. The thesis focuses on text information processing in the network, proceeds the thorough research to text clustering, classification from two levels which are theories and application. First, a model of automatic text classification system is described, which includes five aspects: the information pretreatment,the features denotation,the features extraction,making use of text mining technique extracting classified model(involve text clustering and classification) and evaluating model quantity. Second, the thesis introduces the theory and the key techniques which are word segmentation,features extraction,text clustering and text classification, specially the extraction of clustering guided classification model based on SVM. At last, we construct the Chinese text classification machine, take it to realization by programming and use the true data to test the classification machine. The important part of the thesis is the extraction of clustering guided classification model. Different from traditional classification machine, our research is preceded under the situation of lacking class label and class information, replacing manual classification with clustering in order to gain classification information and the rustle is good.In clustering part, we modify k-means for overcoming its trend limitation, making its clustering result more equal and mostly reflecting the character of clustering. The modified algorithm can increase the classification accuracy.It can find that the data is high dimension and sparse. We bring forward HSMBK and HSSCA algorithms to code with the problem. (1) HSMBK, it uses the bisect partition principle and adopts a new method to count the comparability-- "binary feature sparse otherness". We apply the thought of choosing excellent element to the method of calculating the center of clustering for reducing the effect of the isolated points. At last, we bring forward JW rule based on the enlighten idea. (2) HSSCA, It has two phases: First, it assembles the data to smallchild clusterings. Second, it uses the agglomerate clustering algorithm to unite these small clusterings for getting the needed clustering number. It also adopts other new method to calculate the comparability-"binary feature sparse otherness based on collection".We validate three clustering algorithm by experiment and elect the best algorithm-HSMBK to extract the classification pattern.In classification part, we analyze the advantage of using the Support Vector Machine (SVM) to text classification on theory. The two classical SVM algorithms-C-SVC algorithm and S-SVC algorithm have been done more research and the two algorithms performance has been compared by using practice data. At last, we detailed present the design of Web Chinese Text Classification machine based on SVM.
Keywords/Search Tags:Web Text mining, Text clustering, Text classification, Classification machine, Support Vector Machine
PDF Full Text Request
Related items