Research Of LDA Short Text Classification Algorithm Based On Hadoop Platform

Posted on:2017-02-03

Degree:Master

Type:Thesis

Country:China

Candidate:M Zhang

Full Text:PDF

GTID:2348330515481430

Subject:Management Science and Engineering

Abstract/Summary:

In recent years,with the fast development of network applications,such as instant communication,micro-blog and so on,there have amount of short text information in applications.These data grow up rapidly,and amount of these is huge.How to utilize these massive data,extract valuable information from these,has become the hot research topic in currently.The kinds of short text research have been widely used in various fields,like network public opinion analysis,hot topic found,social networks extension,recommendation of shopping platforms,information security and other aspects.The short text has below kinds of characters:short length of content,sparse features,much noise and others.Traditional text classification methods are not ideal on short text.On the basis of previous studies,this article raise a LDA topic short text classification method which base on co-occurrence relationship.Using Latent Dirichlet Topic model(Latent Dirichlet Allocation LDA)to process the short text,distribution of "Themes-Word" will get;then extract the words which appear in multiple themes simultaneously to establish co-occurrence word set;in the following,use the mutual information MI to calculate the relevance between each word of co-occurrence word set each themes,and screen out the words which have similar relevance with at least two themes,using these words to establish confusion word set.During the actual classification process,the words of confusion word set will reduce their weight to weak their impact on short text classification results.In order to improve the efficiency of the short text classification method which mentioned in this paper,jointly use Hadoop platform as supplementary is recommended,as Hadoop distributed system has a big advantage in dealing with massive data,to optimize the efficiency of classification will be achieved.Text experiments will use two corpora:news headlines corpus and micro-blog corpus.Then make two kinds of experiment scheme:Firstly,using news headlines corpus to verify the feasibility of the algorithm,as this corpus have smaller samples than micro-blog corpus,and compare with other traditional methods,check out whether this method has significant improvement in classification results;Secondly,use this method jointly with Hadoop platform to test the classification of micro-blog corpus,if this will have significantly improvement in efficiency.Finally,through the analysis of these experimental results,not only this article proposed LDA short text classification method based on co-occurrence relationship achieve the expect result that significant improvement in classification result,but also combined this method with Hadoop platform accomplish to improve the efficiency of classification dramatically.

Keywords/Search Tags:

Short-Text Classification, Co-occurrence, LDA, Hadoop, Internet public opinion

Related items

1	Research Of Short Text Classification And Clustering In Public Opinion Analysis
2	Research On Text Classification System For The Internet Public Opinion Analysis
3	Research Of Key Technology On Internet Public Opinion Monitoring System
4	Internet Public Opinion Analysis For Short Text
5	Research And Development Of Network Public Opinion Text Classification System
6	Research And Implementation On Public Opinion Classification Of Microblogging Based On Hadoop
7	Design And Implementation Of Internet Public Opinion Monitoring And Processing Platform Based On Hadoop
8	Research On Key Technologies Of Network Public Opinion Information Identification And Analysis
9	Design And Implementation Of Internet Public Opinion Analysis System Based On Sina Weibo
10	Design And Realization Of An Internet Public Opinion Monitoring System