The Research And Application Of Chinese Web Text Classification

Posted on:2009-06-29

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Wu

Full Text:PDF

GTID:2178360272456542

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

As rapid development of network and information technology, Web pages on the Internet were exponential growth, how to organize and deal with these vast amounts of information effectively, and how to search, filter and manage these resources better; these have become an urgent problem. The Web text classification is the core content of data mining and information retrieval technology, and text classification method based on machine learning has been achieved good results, but how to improve classification accuracy and classification speed is still a problem.This paper mainly studies Chinese Web text, firstly analyses Chinese word segmentation method because of the special nature of the Chinese text, and then presents a new model of rough segmentation, which is based on bi-gram and N-most probably method. In this new model we aim at obtaining few high recalling rate and high efficient rough result, which tries to cover the correct segmentation and unknown words as much as possible such that the quality for the following segmentation can be enhanced. Lastly, as Chinese Web content is informative and updated fast, a new method of Web representation is presented, which is based on new-word discovery, and we represents the Web document using words and new-words finally. The experimental results show that it can help us to identify unknown words and extend the current dictionary, strengthen the representation of Web documents, improve the quality of the adopted vector, and increase the effect of Web document classification.As a simple, effective and nonparametric classification method, KNN method is widely used in Web text classification. But the method not only has large computational demands, because it must compute the similarity between unlabeled text and any training text; but also may decrease the precision of classification because of the commonness of classes. In this paper, an improved KNN method is presented, which solves two problems mentioned above, in this method firstly gets the most K 0 classes fast by Rocchio method, and then uses KNN arithmetic in some representative training texts of the K 0 classes; at last we make class by an improved similar arithmetic in KNN. The result of research indicates that the impact of the new method is better. At the same time, because the Web resources commonly used to be organized by hierarchical structure, this paper also discusses the level classification and brings forward one method of Web text classification based on the combination of hierarchical structure and KNN. In this new algorithm, we use hierarchical structure to fast the classification speed, and KNN algorithm fills level classification of precision. Both Experiments show that these two improved KNN classification algorithm can improve classification efficiency in a large extent, but also to some extent improved classification accuracy.

Keywords/Search Tags:

Chinese words segmentation, feature selection, Web text representation, Web text classification, KNN arithmetic, hierarchical structure

PDF Full Text Request

Related items

1	Text Representation And Algorithms For Chinese Text Classification
2	Research On Classification Method On Chinese Short Texts With Few Words Based On Feature Representation
3	Researches On Hierarchical Chinese Text Classification
4	Research On Network Text Classification Technique
5	Research On Improvement Of Chi-square Feature Selection And Word Vector Text Representation For News Classification
6	Key Technologies Research And Implementation Of Chinese Text Automatic Classification
7	Research And Improvement Of Automatic Classification Technology For Chinese Text
8	Research On Core Technology Of The Chinese Text Classification
9	Automatic Classification Research On Chinese Web Document Orientation
10	Research On Short Text Classification Of Chinese News Based On Machine Learning