Research On Classification Of Chinese Documents Based On Vector Space Model

Posted on:2007-12-11

Degree:Master

Type:Thesis

Country:China

Candidate:P L Liu

Full Text:PDF

GTID:2178360182479276

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of WWW, the number of documents on the internet increasesexponentially. One important research focus on how to deal with these great capacities ofdocuments. Auto text classification is one crucial part of information management. With moreand more people using internet, multifarious web station contains large numbers of Chineseinformation. The bulk of these information exists as text. Because the great difference betweenChinese and English, foreign research production in text classification cannot be entirely usedin Chinese text classification. So the research on classification of Chinese text has veryimportant realistic meaning.This paper mainly focuses on the text classification technique. We researched text modeland algorithms of classification of text in common use, also discussed application of thesealgorithms in classification of Chinese text. We brought forward a new frame of words tablewhich can syncopate words fleetly and a new algorithm of character selection. The algorithmsof text classification are supervised, which means the classifier training need some humanlabeled data of fixed classes. Generally, the accuracy of classifier is higher with more labeleddata. But the labeled data by hand are expensive resource. One Vital problem with textclassification is how to reduce the number of labeled data while maintain the proper accuracy.This paper approach to a novel algorithm, called iterative TFIDF, which combines a largenumber of unlabeled data with small labeled data to train the TFIDF classifier. Iterative TFIDFalgorithm belongs to hill climbing algorithm;It has the common problem of converging to localoptimal value. To deal with this problem, we introduce active learning technology to reduce theconverging speed to local optimal value. The results show this rejoin is helpful, and under thesame experiment data, this algorithm has higher accuracy than kNN,Bayes and normal TFIDF.

Keywords/Search Tags:

text classification, iterative TFIDF algorithm, active learning

PDF Full Text Request

Related items

1	Tfidf-based Text Classification Algorithm Research
2	Research On Chinese Text Classification Algorithm Based On Active Learning Approach
3	Research On Combining Collective Classification With Active Learning
4	Research On Text Classification Of Web Text Mining
5	Research On KNN Text Classification And Term Weighting Algorithm
6	The Design And Implement Of A Mongolian Text Classifier Based On Active Learning SVM
7	Design And Implementation Of Text Classification System Based On Active Learning
8	Research Of Text Classification Algorithm Based On Semi-supervised SVM Active Learning
9	Sentiment Classification By Combining Lexicon-based And Machine Learning Methods
10	Study On Chinese Text Classification Algorithm Based On Rough Set And It's Application