Font Size: a A A

Research On Classification Of Chinese Documents Based On Vector Space Model

Posted on:2007-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:P L LiuFull Text:PDF
GTID:2178360182479276Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of WWW, the number of documents on the internet increasesexponentially. One important research focus on how to deal with these great capacities ofdocuments. Auto text classification is one crucial part of information management. With moreand more people using internet, multifarious web station contains large numbers of Chineseinformation. The bulk of these information exists as text. Because the great difference betweenChinese and English, foreign research production in text classification cannot be entirely usedin Chinese text classification. So the research on classification of Chinese text has veryimportant realistic meaning.This paper mainly focuses on the text classification technique. We researched text modeland algorithms of classification of text in common use, also discussed application of thesealgorithms in classification of Chinese text. We brought forward a new frame of words tablewhich can syncopate words fleetly and a new algorithm of character selection. The algorithmsof text classification are supervised, which means the classifier training need some humanlabeled data of fixed classes. Generally, the accuracy of classifier is higher with more labeleddata. But the labeled data by hand are expensive resource. One Vital problem with textclassification is how to reduce the number of labeled data while maintain the proper accuracy.This paper approach to a novel algorithm, called iterative TFIDF, which combines a largenumber of unlabeled data with small labeled data to train the TFIDF classifier. Iterative TFIDFalgorithm belongs to hill climbing algorithm;It has the common problem of converging to localoptimal value. To deal with this problem, we introduce active learning technology to reduce theconverging speed to local optimal value. The results show this rejoin is helpful, and under thesame experiment data, this algorithm has higher accuracy than kNN,Bayes and normal TFIDF.
Keywords/Search Tags:text classification, iterative TFIDF algorithm, active learning
PDF Full Text Request
Related items