Improvement And Application To Weighting Terms Based On Text Classification

Posted on:2008-07-22

Degree:Master

Type:Thesis

Country:China

Candidate:G Li

Full Text:PDF

GTID:2178360242971566

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

People can gain more and more knowledge along with the fast development of the network and information technology. However, in the face of specific knowledge, it is difficult to obtain it quickly in the vast world of information. Although it has been solved for some degree by great many kinds of search engines, which only simply include some key words and the results are vast. So it's unbenefit for people to find their specific information. Document automatic classification which is an efficiency method,has become a valuable technology.In recent years,great many statistics theories and methods of machine learning have been used for document automatic classification,which becomes hot research.One of the most difficulties in document automatic classification is the high dimension of feature space and the sparseness of text representation vector. In order to lower the dimension of feature space and improve the efficiency and precision of classification, it is the first problem of document automatic classification to find an effective calculation algorithm of words weight. In the research process of Chinese text classification, this paper focuses on the improvement of calculation algorithm of words weight, and have completed the following works:①The traditional calculation algorithm of words weight is mainly studied in this paper and is found that it has three limitations: 1)it does not take into account the distribution of feature terms among categories; 2) it does not take into account the inner-category distribution of feature terms; 3)it does not take into account the partial-classification of feature terms. From the views of frequency degree,integration degree, and distribution degree of terms, this paper gives the calculation algorithm of words weight: TF-IDF-DI-WFDB.②The proposed measure is introduced which describes the inter-category and inner-category distribution information of the feature terms by using the inter-category and inner-category distribution degree of the feature terms in this paper,which forms the improved algorithm of words weight: TF-IDF-DI. And the traditional calculation algorithm of words weight does not take into account the partial-classification of feature terms, this paper introduced word frequency differentia based(WFDB) to make up the shortcoming,which forms the improved algorithm of words weight of this paper: TF-IDF-DI-WFDB.③In order to verify the improved algorithm of words weight :TF-IDF-DI-WFDB is better than the traditional calculation algorithm of words weight, this paper gives the first experiment of classifying through KNN algorithm. From the views of whole confusion martrix,whole recall rate, precision rate and recall rate,precision rate of every class, the result shows that classification result by using the improved algorithm of words weight is better than the result by using the traditional calculation algorithm of words weight.④On the basement of the improved algorithm of words weight :TF-IDF-DI-WFDB, the paper uses genetic algorithm to train classifier. The result shows that the classification result of genetic algorithm corresponds to that of KNN classification, to some extent, it is better. So, it is proved that the improved calculation algorithm of words weight of this paper is correct and practical .

Keywords/Search Tags:

Text representation, Feature vector, Vector space model, TFIDF, Genetic algorithm

PDF Full Text Request

Related items

1	Research On Feature Selection Of Text Classification
2	Research On Text Representation Model Based On LDA And Latent Feature Vector
3	Text Representation Model And Feature Selection Algorithm
4	Study On Chinese Text Classification Algorithm Based On Rough Set And It's Application
5	Research On Classification Of Chinese Documents Based On Vector Space Model
6	Study On Feature Selection Of Chinese Document Categorization
7	Research Of Text Categorization Based On Vector Space Model
8	Research And Improvement Of Automatic Text Classification Algorithm Based On The Vector Space Model
9	The Research And Implementation Of Chinese Text Categorization
10	Research Of Adaptive Text Filtering System Based On Vector Space Model