Font Size: a A A

The Research And Application Of Chinese Web Text Classification

Posted on:2009-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:C Y WuFull Text:PDF
GTID:2178360272456542Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As rapid development of network and information technology, Web pages on the Internet were exponential growth, how to organize and deal with these vast amounts of information effectively, and how to search, filter and manage these resources better; these have become an urgent problem. The Web text classification is the core content of data mining and information retrieval technology, and text classification method based on machine learning has been achieved good results, but how to improve classification accuracy and classification speed is still a problem.This paper mainly studies Chinese Web text, firstly analyses Chinese word segmentation method because of the special nature of the Chinese text, and then presents a new model of rough segmentation, which is based on bi-gram and N-most probably method. In this new model we aim at obtaining few high recalling rate and high efficient rough result, which tries to cover the correct segmentation and unknown words as much as possible such that the quality for the following segmentation can be enhanced. Lastly, as Chinese Web content is informative and updated fast, a new method of Web representation is presented, which is based on new-word discovery, and we represents the Web document using words and new-words finally. The experimental results show that it can help us to identify unknown words and extend the current dictionary, strengthen the representation of Web documents, improve the quality of the adopted vector, and increase the effect of Web document classification.As a simple, effective and nonparametric classification method, KNN method is widely used in Web text classification. But the method not only has large computational demands, because it must compute the similarity between unlabeled text and any training text; but also may decrease the precision of classification because of the commonness of classes. In this paper, an improved KNN method is presented, which solves two problems mentioned above, in this method firstly gets the most K 0 classes fast by Rocchio method, and then uses KNN arithmetic in some representative training texts of the K 0 classes; at last we make class by an improved similar arithmetic in KNN. The result of research indicates that the impact of the new method is better. At the same time, because the Web resources commonly used to be organized by hierarchical structure, this paper also discusses the level classification and brings forward one method of Web text classification based on the combination of hierarchical structure and KNN. In this new algorithm, we use hierarchical structure to fast the classification speed, and KNN algorithm fills level classification of precision. Both Experiments show that these two improved KNN classification algorithm can improve classification efficiency in a large extent, but also to some extent improved classification accuracy.
Keywords/Search Tags:Chinese words segmentation, feature selection, Web text representation, Web text classification, KNN arithmetic, hierarchical structure
PDF Full Text Request
Related items