Font Size: a A A

Research On Text Categorization And Technologies

Posted on:2008-06-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:W Q ShangFull Text:PDF
GTID:1118360242466082Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of World Wide Web, large numbers of documents are available on the Internet. Automatic text categorization becomes more and more important for dealing with massive data. Text categorization has become a key technology in organizing and processing large amount of text data. As most classifiers us the Vector Space Model (VSM), text preprocessing has become the bottleneck of categorization. The results of text preprocessing affect directly categorization performance. Therefore, this dissertation thoroughly investigates text categorization algorithms, concentrates on text preprocessing algorithms and improves the categorization performance effectively.The contributions of this dissertation are as follows:(1) Improved the text preprocessing algorithmsThis dissertation proposes a novel text feature selection algorithm. Before text categorization, we must select words that can delegate text features best as dimensions of feature space. The aim is to reduce the dimensions of the feature space and to improve the categorization performance of a classifier. Based on the analysis and research of existing text preprocessing algorithms, we have improved the Gini index algorithm which has been used for attribute selection of a decision tree and use this improved algorithm to select text features. This algorithm improves the categorization performance effectively.This dissertation also proposes a novel feature weight algorithm. For text preprocessing based on VSM (Vector Space Model), after selecting feature words, we must weight the feature words further, in order to give prominence to important words which affect categorization and restrain subordinate words or noise words. The classical feature weight algorithm is the TF-IDF method. Based on the research of merits and demrits of this method, we adopt an improved Gini index algorithm to improve this method and improve the categorization performance of the classifier effectively.(2) Improved kNN text classifierThis dissertation improved the decision rule of kNN classifiers. In recent years, many text categorization algorithms have been proposed. Among them, the kNN algorithm has been investigated by many researchers and proved as one of the methods with the best categorization performance. This dissertation mainly concentrates on improving the decision rule of kNN algorithm, adopting the theory of fuzzy sets, constructing a new membership function based on document similarities, and improving the categorization performance effectively, especially when the class distribution is unbalanced.Based on the fuzzy kNN text classifier, this dissertation adopts improved Gini index to weight features and improves the categorization performance of fuzzy kNN.(3) Improved Naive Bayes text classifierNavie Bayes classifier is one of the best text categorization algorithms. This dissertation adopts the improved Gini index algorithm to improve categorization decision rule of the Naive Bayes and designs new categorization decision rule. This method improves the categorization performance of Navie Bayes effectively.(4) Proposed a novel text categorization modelIn numerous text categorization algorithms, SVM (Support Vector Machine), kNN and Naive Bayes have been proposed and proved to have a better categorization performance. Based on the research of these categorization algorithms, this dissertation puts forward a novel text categorization algorithm which is based on an improved Gini index. This method absorbs the merits of the above three algorithms and overcomes their demerits, and improves the categorization performance greatly. This dissertation provides the theoretical proof of the algorithm's feasibility and proves its validity by experiments. This method is a promising text categorization algorithm.All the algorithms presented in this dissertation are verified by experimental results.
Keywords/Search Tags:text categorization, text preprocessing, feature selection, feature weight, Gini index
PDF Full Text Request
Related items