Font Size: a A A

The Research On Automated Text Categorization Algorithms

Posted on:2006-09-02Degree:MasterType:Thesis
Country:ChinaCandidate:W B ZhuFull Text:PDF
GTID:2178360185465379Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of the Internet, the online documents available increase exponentially. How to deal with these plentiful online documents automatically is an important research direction. Text categorization organizes the text documents on the net into categories, provides more efficient search strategies for information retrieval and makes it return more accurate results. The research on automated text categorization algorithms is the topic of this thesis.The thesis firstly introduces general development of automated text categorization. Then, some experiments and analyses are made to compare the performances of some typical categorization algorithms, such as Naive Bayes (NB), TFIDF, k Nearest Neighbors (k-NN) and Support Vector Machine (SVM), lay basic theoretical and experimental supports for the research in the following chapters.Although smoothing can make NB algorithm avoid zero-probability estimate, it has some disadvantages. In the thesis, two new strategies: NBTF and NBTS, are proposed, which can remove zero-probability estimate from NB algorithm without smoothing. According to experiments and analyses, the new algorithms show good performances in effectiveness and applicability compared with Laplace and SGT smoothing.The thesis also focuses on the problem whether training documents weight-adjusting can improve single classifier's performance. Two new algorithms: KTrainl and KTrain2, which use simpler weight-adjusting strategies than what is adopted by AdaBoost algorithm, are proposed. Through experiments and analyses, the new algorithms show better performances compared with the base learning algorithms: NB and TFIDF.By analyzing TFIDF and k-NN algorithms and integrating the idea that increasing the weights of incorrectly classified training documents can improve the classifier's performance, the thesis proposes an S-TFIDF algorithm, which is an improved version of TFIDF algorithm by adopting k-NN algorithm's idea. Experiments verify that S-TFIDF algorithm outperforms TFIDF and k-NN algorithms. Moreover, S-TFIDF algorithm is as efficient as TFIDF algorithm, which implies it is competent for large scale text categorization task.
Keywords/Search Tags:Automated text categorization, Zero-probability estimate, Smoothing, Training documents weight-adjusting, Combine TFIDF and k-NN
PDF Full Text Request
Related items