Font Size: a A A

Improvement And Application To Weighting Terms Based On Text Classification

Posted on:2008-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:G LiFull Text:PDF
GTID:2178360242971566Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
People can gain more and more knowledge along with the fast development of the network and information technology. However, in the face of specific knowledge, it is difficult to obtain it quickly in the vast world of information. Although it has been solved for some degree by great many kinds of search engines, which only simply include some key words and the results are vast. So it's unbenefit for people to find their specific information. Document automatic classification which is an efficiency method,has become a valuable technology.In recent years,great many statistics theories and methods of machine learning have been used for document automatic classification,which becomes hot research.One of the most difficulties in document automatic classification is the high dimension of feature space and the sparseness of text representation vector. In order to lower the dimension of feature space and improve the efficiency and precision of classification, it is the first problem of document automatic classification to find an effective calculation algorithm of words weight. In the research process of Chinese text classification, this paper focuses on the improvement of calculation algorithm of words weight, and have completed the following works:①The traditional calculation algorithm of words weight is mainly studied in this paper and is found that it has three limitations: 1)it does not take into account the distribution of feature terms among categories; 2) it does not take into account the inner-category distribution of feature terms; 3)it does not take into account the partial-classification of feature terms. From the views of frequency degree,integration degree, and distribution degree of terms, this paper gives the calculation algorithm of words weight: TF-IDF-DI-WFDB.②The proposed measure is introduced which describes the inter-category and inner-category distribution information of the feature terms by using the inter-category and inner-category distribution degree of the feature terms in this paper,which forms the improved algorithm of words weight: TF-IDF-DI. And the traditional calculation algorithm of words weight does not take into account the partial-classification of feature terms, this paper introduced word frequency differentia based(WFDB) to make up the shortcoming,which forms the improved algorithm of words weight of this paper: TF-IDF-DI-WFDB.③In order to verify the improved algorithm of words weight :TF-IDF-DI-WFDB is better than the traditional calculation algorithm of words weight, this paper gives the first experiment of classifying through KNN algorithm. From the views of whole confusion martrix,whole recall rate, precision rate and recall rate,precision rate of every class, the result shows that classification result by using the improved algorithm of words weight is better than the result by using the traditional calculation algorithm of words weight.④On the basement of the improved algorithm of words weight :TF-IDF-DI-WFDB, the paper uses genetic algorithm to train classifier. The result shows that the classification result of genetic algorithm corresponds to that of KNN classification, to some extent, it is better. So, it is proved that the improved calculation algorithm of words weight of this paper is correct and practical .
Keywords/Search Tags:Text representation, Feature vector, Vector space model, TFIDF, Genetic algorithm
PDF Full Text Request
Related items