Font Size: a A A

Research On Chinese Microblog Topic Hierarchical Identification Method

Posted on:2015-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:C Z LiFull Text:PDF
GTID:2268330428483770Subject:System theory
Abstract/Summary:PDF Full Text Request
With the continuous development of web2.0,Microblog gradually becomes an important social networking platform with which people exchange emotion and understand latest events.People are no longer confronted with the problem of lack of information,on the contrary,due to the characteristics of low threshold and fast dissemination,Micorblog data increase explosively and it is difficult for people to know the topics even some aspects of them which people have been discussing or concerning in recently from these mixed data.Therefore,in this paper, based on analyzing the characteristics and the way of transmission of Microblog, the topic hierarchical identification method of Chinese Microblog is discussed.Firstly,the way of transimission of Microblog and the method of traditional data acquisition are analyzed.On this basis,the paper proposes a new way of data acquisition which is based on high quality accounts.By considering the number of fans, concerning users,texts and grade,a initial user seed is selected,and then expands the user account list according to it,at last, extracts their Microblog texts.Secondly, the methods of data preprocessing and representation on Chinese Microblog are discussed.Thirdly,in accordance with the present situation which Microblog topic identification is only conducted on coarse-grained level,the paper analyzes two typical topic identification methods.They are LDA(Latent Dirichlet Allocation) based on Dirichlet Allocation and SinglePass.By analyzing the characteristics and application limitation of them,a new Chinese Microblog topic hierarchical identification method called LSP is proposed.LSP combines the advantages of LDA and SinglePass.On the one hand,considering the large amount of data and sparse feature,LDA is used to identify the first layer of topics.On the other hand,the paper improves the traditional SinglePass by introducing the function of comment and forwarding when identifying the sublayer topics.Meanwhile, for the reason of the sparse feature of Microblog, the combination algorithm based on semantic similarity and statistical similarity is put forward.The algorithm uses Hownet as background knowledge to calculate semantic similarity and introduces the relevance of words,so that the texts which contain different synonyms or related semantic can also be identified and thus improves the accuracy.At last,through Sina Microblog dataset,the Chines Microblog topic hierarchical identification method LSP is verified.The experimental results show that the given topic hierarchical identification method can effectively express the hierarchy of topics.
Keywords/Search Tags:Microblog, Topic Identification, LDA, SinglePass, Similarity
PDF Full Text Request
Related items