Font Size: a A A

Research On Improved KNN Chinese Web Page Classification Based On Weka Platform

Posted on:2019-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:C HuangFull Text:PDF
GTID:2438330548457586Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web page is one of the most important media for information transmission,and the main form of Web information is text information,which is an important function of social interaction,entertainment,news and knowledge.At present,the number of web pages is growing at a speed that is beyond people's imagination,the traditional artificial classification method is unrealistic.Moreover,due to the explosive growth of the number of web pages,a large number of irrelevant noise pages are flooded,and it is more and more difficult for people to find the information they need quickly and effectively.Therefore,it has become an important research topic to organize and manage web page information reasonably and effectively.Chinese web page classification is based on the purpose,to categorize the web pages by using the relevant technology of text classification,so that user can be targeted when retrieving the web page,and it is also convenient for the portal website to classify the web pages.This paper is based on the relevant research and analysis of the whole process of Chinese Web page classification,KNN is selected as a text classifier for web pages.The KNN algorithm is a simple and effective non parametric classification method,which is widely used in text classification experiments.This paper focuses on the high dimensional text problems encountered in text classification,a DC-DF feature extraction method is proposed to reduce the feature items and reduce the dimension of text.Based on the research and analysis of the advantages and disadvantages of the KNN algorithm,in view of the problem that test text should be calculated in similarity with a large number of training set samples,a KNN algorithm based on grouping center vector is proposed.The center vectors of each group were obtained by grouping the sample sets in the category.The center vector sets is used to calculate the similarity of the training library so as to improve the classification performance of the algorithm.Experiments show that the improved algorithm has improved the precision rate,recall rate and F-measure compared with the traditional KNN algorithm,and has some advantages compared with other classification algorithms.
Keywords/Search Tags:Chinese web page classification, KNN algorithm, Text classification, Feature extraction, Group center vector
PDF Full Text Request
Related items