The Research On Automated Text Categorization Algorithms

Posted on:2006-09-02

Degree:Master

Type:Thesis

Country:China

Candidate:W B Zhu

Full Text:PDF

GTID:2178360185465379

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of the Internet, the online documents available increase exponentially. How to deal with these plentiful online documents automatically is an important research direction. Text categorization organizes the text documents on the net into categories, provides more efficient search strategies for information retrieval and makes it return more accurate results. The research on automated text categorization algorithms is the topic of this thesis.The thesis firstly introduces general development of automated text categorization. Then, some experiments and analyses are made to compare the performances of some typical categorization algorithms, such as Naive Bayes (NB), TFIDF, k Nearest Neighbors (k-NN) and Support Vector Machine (SVM), lay basic theoretical and experimental supports for the research in the following chapters.Although smoothing can make NB algorithm avoid zero-probability estimate, it has some disadvantages. In the thesis, two new strategies: NBTF and NBTS, are proposed, which can remove zero-probability estimate from NB algorithm without smoothing. According to experiments and analyses, the new algorithms show good performances in effectiveness and applicability compared with Laplace and SGT smoothing.The thesis also focuses on the problem whether training documents weight-adjusting can improve single classifier's performance. Two new algorithms: KTrainl and KTrain2, which use simpler weight-adjusting strategies than what is adopted by AdaBoost algorithm, are proposed. Through experiments and analyses, the new algorithms show better performances compared with the base learning algorithms: NB and TFIDF.By analyzing TFIDF and k-NN algorithms and integrating the idea that increasing the weights of incorrectly classified training documents can improve the classifier's performance, the thesis proposes an S-TFIDF algorithm, which is an improved version of TFIDF algorithm by adopting k-NN algorithm's idea. Experiments verify that S-TFIDF algorithm outperforms TFIDF and k-NN algorithms. Moreover, S-TFIDF algorithm is as efficient as TFIDF algorithm, which implies it is competent for large scale text categorization task.

Keywords/Search Tags:

Automated text categorization, Zero-probability estimate, Smoothing, Training documents weight-adjusting, Combine TFIDF and k-NN

PDF Full Text Request

Related items

1	Tfidf-based Text Classification Algorithm Research
2	Design And Realization Of Automated Text Categorization System For Chinese Documents Based On Relevancy
3	The Research On A Term Weight Calculation Method Based On The Term Mathmatical Expection
4	A Study On Key Issues Of Automated Text Categorization For Chinese Documents
5	Research On Automated Text Categorization Based On RBF Network
6	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
7	Application Of Improved TFIDF Algorithm In Text Analysis
8	Text Categorization On Machine Learning Algorithm
9	Design And Implementation Of Kazak Text Categorization System
10	Research On Chinese Text Categorization Algorithms Based On Technology Text