Font Size: a A A

Research On Text Feature Selection Algorithm And Its Application In Micro-Blog

Posted on:2017-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y RenFull Text:PDF
GTID:2348330491950956Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the increasing popularity of the Internet,the amount of information on the Internet also increases year by year,and these data are basically savedas text.Most of these huge data are complex,which makes us get effective data information becomes very difficult.Text mining technology has come into being,text categorization is one of the text mining technology,it can solve the problem of complicated data,and it can help people view of processing the data information effectively.Of course,if you want the text classificationto be accuracy,not only the text classification algorithm,feature selection method is also very important;secondly,effective presentation of mining results are also worthy of studying.Feature selection method of main work is selected in the classification of text data,which has a representative value of key items,this method can not only choose the value of key items,but also be able to delete some noise words in the text classification,which can reduce the dimensions of the text and the text classification accuracy improved.Now,the feature selection method for too much to consider the value of the low-frequency words,this had a certain influence on classification effect,so this article is based on chi-square feature selection algorithm is proposed to improve the proposal,first,simplify the formula of the chi-square feature selection algorithm,considering characteristics of the positive correlation with the text,the negative correlation between default chi-square value is zero,so that we can reduce the computation time of the machine,then we introduce the characteristics of a general word in a certain category number of alpha as adjustable parameters,so that we can use the alpha parameters which introduced to chi-square feature selection method for excessive emphasis of low-frequency word problems.Because micro-blogging rapid growth in Internet users,which has become an important public media.It is not only an important tool of the Internet usersobtain real-time news,reports,broaden the field of vision and make friends,but also become an important social platform of social public opinion.At present,micro-blog registered users has reached more than 500 million,which shows micro-blog platform has played a more and more important role in people's lives.However,micro-blog produce a large number of short text data every day,how to dig out the effective information for statistics workers from text data effectively,is also an important topic for statistics workers in new media age.Be based on feature selection,this paper takes the information published by Anqing Teachers College students in micro-blogplatform as an example,Using improved feature selection method and the formula of weight combination,using R language tools and related statistical analysis,and improve the display of high frequency wordsin word cloud,finally,the subject wordsare displayed in a more intuitive form of the word cloud.
Keywords/Search Tags:text categorization, feature selection, low frequency words, TF-IDF, weibo, word cloud
PDF Full Text Request
Related items