Font Size: a A A

Research Of LDA Short Text Classification Algorithm Based On Hadoop Platform

Posted on:2017-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhangFull Text:PDF
GTID:2348330515481430Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the fast development of network applications,such as instant communication,micro-blog and so on,there have amount of short text information in applications.These data grow up rapidly,and amount of these is huge.How to utilize these massive data,extract valuable information from these,has become the hot research topic in currently.The kinds of short text research have been widely used in various fields,like network public opinion analysis,hot topic found,social networks extension,recommendation of shopping platforms,information security and other aspects.The short text has below kinds of characters:short length of content,sparse features,much noise and others.Traditional text classification methods are not ideal on short text.On the basis of previous studies,this article raise a LDA topic short text classification method which base on co-occurrence relationship.Using Latent Dirichlet Topic model(Latent Dirichlet Allocation LDA)to process the short text,distribution of "Themes-Word" will get;then extract the words which appear in multiple themes simultaneously to establish co-occurrence word set;in the following,use the mutual information MI to calculate the relevance between each word of co-occurrence word set each themes,and screen out the words which have similar relevance with at least two themes,using these words to establish confusion word set.During the actual classification process,the words of confusion word set will reduce their weight to weak their impact on short text classification results.In order to improve the efficiency of the short text classification method which mentioned in this paper,jointly use Hadoop platform as supplementary is recommended,as Hadoop distributed system has a big advantage in dealing with massive data,to optimize the efficiency of classification will be achieved.Text experiments will use two corpora:news headlines corpus and micro-blog corpus.Then make two kinds of experiment scheme:Firstly,using news headlines corpus to verify the feasibility of the algorithm,as this corpus have smaller samples than micro-blog corpus,and compare with other traditional methods,check out whether this method has significant improvement in classification results;Secondly,use this method jointly with Hadoop platform to test the classification of micro-blog corpus,if this will have significantly improvement in efficiency.Finally,through the analysis of these experimental results,not only this article proposed LDA short text classification method based on co-occurrence relationship achieve the expect result that significant improvement in classification result,but also combined this method with Hadoop platform accomplish to improve the efficiency of classification dramatically.
Keywords/Search Tags:Short-Text Classification, Co-occurrence, LDA, Hadoop, Internet public opinion
PDF Full Text Request
Related items