Font Size: a A A

Research Of Chinese Web Text Categorization Based On KNN Algorithm

Posted on:2011-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:2178330332962523Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Along with the rapid development of the Internet and the popularization of the computer, people can more convenient in releasing and getting information on WWW, most of these information on the Web pages. The number of Web pages is to daily growth rate of several million. It has become an urgent problem to be solved in information processing areas that how can we seek and obtain valuable information and knowledge from the great amount of Web pages.Web pages are semi-structured or unstructured text data type, text classification is an important technical means for deriving useful information and knowledge from them. Compared with traditional text classification, prior to Web text classification, HTML tags, scripting code, ad code, and copyright information in Web pages are needed removed. Otherwise, it will influence the text classification. But some special HTML tags often show that these vocabularies between this HTML tags are important. Based on this idea, this paper presents a new algorithm to calculating weight of the characteristics basing their location.KNN (K-Nearest Neighbor) algorithm is commonly used in text classification, classification efficiency is better than the other text classification algorithms. KNN algorithm is easily affected by the K value and distribution of training text. Namely, if k value improper selection or big difference existing in different categories training texts, then classification performance is very unstable. In order to improve the classification accuracy and stability of KNN algorithm, this paper presents an improved KNN classification algorithm.In this paper, it is introduced and analyzed in detail that algorithms of calculating weight of characteristics and text classification algorithms which commonly be used, and experiments have been done with the proposed improved algorithms. Experimental results show the feasibility and effectiveness of the two improved algorithms.
Keywords/Search Tags:Chinese Web Text Categorization, Feature Selection, Weight Calculation, KNN
PDF Full Text Request
Related items