Font Size: a A A

Topic Detection And Trend Prediction Of Web Text

Posted on:2014-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:L T WangFull Text:PDF
GTID:2268330392473590Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of Web2.0technology, Internet users are no longer just theinformation seeker,and become the information creator. Meanwhile,as the emergenceof SNS (Social Network Service), more and more people become the role ofinformation creator, and develop into self-media group. Because of the characteristicof social media,i.e., convenient, real-time, the users-written texts are mainly shorttexts and its number reaches more than billions. As the burst growth of information,the needs of users are not for seeking mass information, and become to seekingintegrated information.In response to this demand, this paper studies and realizes a web text miningsystem. First, after preprocessing the web text, texts that have the similar topic areaggregated into one topic cluster. Then,based on the proposed model of topic mining,we achieve the purpose of topic detecting and trend predicting. Atfer the textclustering analysis and topic mining,disordered texts are integrated to topicdescriptions. In this paper, main works and innovations are as follows:First,a short text semantic distance calculation method is studied andimplemented. The method takes into account the impact of short text words and wordsstructure of the semantic representation of the text,and regards the semantic distanceas a comprehensive distance of words distance and structural distance. Whencalculating the structural distance, we found the maximum matching between textsbased on HIT-CIR Tongyici Cilin (Extended edition) was the best representationindex of how extent the sequence arrangement was. Then we compared the wordsmeaning between texts using a words similarity,which was a kind of improved editdistance that gave a word a unique weight according its types. Finally,we calculatethe semantic distance between texts as a balance of structural distance and wordsdistance. Experimental results show our methods are efficient better than those twoclassical distance calculating model.Secondly, this paper presents a short text from the penalty algorithm based on thelength of the text content words. In order to eliminate the influence of sentencelengths,we use distinct words length to adjust above semantic distance. By usingHeap’s law and Zipf’s law, a distinct words length estimated method was presented.Finally,a topic mining model is studied and implemented,and topic mining includes the topic detecting and trend predicting. By analyzing texts in topic cluster,the topic detecting extracts keyword descriptions of the topic cluster. Trend predictingis based on the topic description, and analyzes the trend of the topic to predict thedevelopment of the topic. There is a direct relationship between topic propagation andusers’ concern, while the existence of active users has an important influence on thepropagation of topic. This paper analyzes the user’s concern model, and finds that theuser’s concern model can predict the trend of the topic.In addition, based on the Tweets corpus of Twitter retrieval task in TREC2011,we establish a tweet information database, we save the original field information ofthe corpus, and then according to the hashtags in the tweet, we classify the corpus intodifferent classes, and the classes can be used for short text classification. Finally,according to the flow of information between the users, we establish the informationlfow network. By using the database, researchers in related fields can carry out theresearches efficiently.
Keywords/Search Tags:short text processing, semantic cluster, hot topic mining, trends prediction
PDF Full Text Request
Related items