Font Size: a A A

Feature Selection For Unbalanced Data And Emotional Dictionary Building

Posted on:2015-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:J Y WuFull Text:PDF
GTID:2298330452453218Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of modern technology, the human world has enteredthe Big Data Time. Valuable information is submerged in the ocean of data, so thetechnique of data mining has been a hot topic of artificial intelligence in recent years.Automatic text classification is one of the key technologies of informationprocessing,it has been widely studied and applied. However, due to the rapidexpansion of information and many new words continue to emerge, results in thefeature dimension is too high, cause a "curse of dimensionality". In order to processinformation more effectively, we should remove a large number of redundant featuresand noise characteristics for reducing feature dimension. Feature selection, as aneffective dimensionality reduction method, is gaining more and more attention. Thispaper mainly focus on two research about feature selection technique:design asuitable feature selection method for imbalanced data sets; expand the application offeature selection techniques on the emotional weight calculation thesaurus inemotional classification analysis.To solve the problem of unbalanced datasets, this paper proposes a featureselection method based on category-weighted strategy and variance statistics strategy.Firstly,we assign larger weights to rare categories,thus these features whichcharacterize rare categories will be strengthened,and the performance on rarecategories can be improved. Then a method of variance statistics is presented todevelop feature selection. Finally,based on the above two strategies, a new featureselection algorithm combines Information Gain(IG) and χ2-statistic(CHI) isdeveloped. Experiments on Reuters-21578corpus and Fudan corpus (unbalanceddatasets) show that new algorithm has better performances on Micro F1and Macro F1than those of IG, CHI and DFICF.In text sentiment analysis, building emotional dictionary is very important.However, the existing research mainly stay in polarity discrimination of simpleexpression. Researchers study weight assignment of emotional words rarely, and theexisting methods need to select benchmark words. To solve this problem, we proposean automatic weight calculation approach of emotional words based on featureselection technique. Firstly we proposed the related assumptions between the emotional weight of words and the emotional tendency of texts, then improve IG andCHI for sentiment classification, and uses the improved measures for calculating theweight of emotional words. Experimental results show that using the emotiondictionary with the calculated weight for text sentiment classification can improve theclassification accuracy, so the proposed algorithm can not only realize automaticcalculation, but also is reasonable and effective.
Keywords/Search Tags:text classification, feature selection, unbalanced datasets, construction ofemotion dictionary, weight calculation
PDF Full Text Request
Related items