Font Size: a A A

Research On Text Classification Based On Feature Selection And Feature Weighting Algorithm

Posted on:2016-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:H ShiFull Text:PDF
GTID:2208330470450253Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet which is a new way of informationdissemination, people not only can easily get the whole world information you want but alsocan pass the information you have to he whole world. So the information resources people canobtain on the Internet is exploding. At the same time, such as tablet computers, smart phones,Internet terminals, as well as the emergence of various social networks such as renren, weibo,micro letter, all kinds of recruitment websites and dating sites and so on, more greatlyaccelerated the pace of the Internet data information to increase. In the face of such a hugeinformation resources, it has already become a hot spot of study that how to make reasonableand effective management of it and make people obtain information more convenient.Fortunately the text classification technology in text mining can solve this problem effectively.Text classification is a very complex project. Based on the detailed understanding of eachprocess, this paper focuses on studying the feature dimension reduction and feature weighting.After the text preprocessing, the text is represented as a high dimension and sparse featurevector space. It not only increases the classification time complexity and space complexity butalso greatly affects the accuracy of the classification. Fea ture dimension reduction can solvethis problem effectively. It includes feature extraction and feature selection. In contrast, thefeature selection algorithm is widely used in text classification system because of its simplicityand more ideal effect of dimension reduction. Firstly introduce several common featureselection algorithms simply, and then focus on introducing the information gain algorithmwhich has better effect proved by scholars. On the basis of elaborate analysis of the frequencyof features within category, distribution within category and the distribution among differentcategories, an improved Information Gain algorithm named IGimpwas proposed to solve theinsufficient consideration of the frequency of features in traditional information gain featureselection algorithm.Since the classification ability of each feature is not same, the weight of the feature justcan reflect the ability. But different feature weighting algorithms have a great influence on thestructure of the text space vector. So in this paper, firstly introduce several common featureweighting algorithms and their advantages and disadvantages, and then focus on introducing thedisadvantages of the feature weighting algorithm TD-IDF. Firstly make improvements aiming atthe shortcoming of IDF. To make further improvements, the information distribution entropyparameters of within category and among different categories was proposed according to theconcept of entropy.To verify the effectiveness of the improved informatio n gain feature selection algorithmand the improved TF-IDF feature weighting algorithm proposed in this paper, two contrast experiments was set on Chinese text classification experiment platform have been taken.In thefirst experiment compare the improved IGimpalgorithm with other four kinds of commonfeature selection algorithms, and in the second experiment compare the improved TD-NIDFimpalgorithm with the traditional TD-IDF algorithm.The two experiments use precision, recall andF1measurement to analyse the improved algorithms. The experimental results show that IGimpand TD-NIDFimpproposed in the paper are better than the traditional algorithms and improvedalgorithms have certain effectiveness.
Keywords/Search Tags:text classification, feature dimension reduction, feature weighting, word frequency, information distribution entropy
PDF Full Text Request
Related items