Research On Improved KNN Chinese Web Page Classification Based On Weka Platform

Posted on:2019-04-22

Degree:Master

Type:Thesis

Country:China

Candidate:C Huang

Full Text:PDF

GTID:2438330548457586

Subject:Computer application technology

Abstract/Summary:

Web page is one of the most important media for information transmission,and the main form of Web information is text information,which is an important function of social interaction,entertainment,news and knowledge.At present,the number of web pages is growing at a speed that is beyond people’s imagination,the traditional artificial classification method is unrealistic.Moreover,due to the explosive growth of the number of web pages,a large number of irrelevant noise pages are flooded,and it is more and more difficult for people to find the information they need quickly and effectively.Therefore,it has become an important research topic to organize and manage web page information reasonably and effectively.Chinese web page classification is based on the purpose,to categorize the web pages by using the relevant technology of text classification,so that user can be targeted when retrieving the web page,and it is also convenient for the portal website to classify the web pages.This paper is based on the relevant research and analysis of the whole process of Chinese Web page classification,KNN is selected as a text classifier for web pages.The KNN algorithm is a simple and effective non parametric classification method,which is widely used in text classification experiments.This paper focuses on the high dimensional text problems encountered in text classification,a DC-DF feature extraction method is proposed to reduce the feature items and reduce the dimension of text.Based on the research and analysis of the advantages and disadvantages of the KNN algorithm,in view of the problem that test text should be calculated in similarity with a large number of training set samples,a KNN algorithm based on grouping center vector is proposed.The center vectors of each group were obtained by grouping the sample sets in the category.The center vector sets is used to calculate the similarity of the training library so as to improve the classification performance of the algorithm.Experiments show that the improved algorithm has improved the precision rate,recall rate and F-measure compared with the traditional KNN algorithm,and has some advantages compared with other classification algorithms.

Keywords/Search Tags:

Chinese web page classification, KNN algorithm, Text classification, Feature extraction, Group center vector

Related items

1	Research And Implementation Of Chinese Automatic Text Classification System Based On SVM
2	Term Weight-Based Chinese Text Classification Algorithm
3	The Design And Implementation Of The Chinese E-Mail Classification System Based On Text Classification Technology
4	Study On Some Chinese Text Classification Technology And Applications In Knowledge Extraction
5	Research And Implementation Of Automatic Classification System And Key Technologies On Chinese Web Page
6	Research And Implementation On A Web Page Classification System
7	Research Of Chinese Page Automatic Classification Based On Vector Space Model
8	Research On Feature Description And Classifier Construction Algorithm In Chinese Text Classification
9	Chinese Text Classification Algorithm
10	Research On Short Text Classification Of Chinese News Based On Machine Learning