Font Size: a A A

Research And Implementation On Building Subtopics For Chinese Microblog

Posted on:2014-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:R H ChenFull Text:PDF
GTID:2308330479479214Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As one of new information exchange media, microblogs almost report the event of social, political, economic, cultural and other fields. Mass of information is generated from microblogs’ platform at every moment, so that it’s difficult for users to read all microblogs in a very short time to know what are the microblogs telling about. Therefore, the topic becomes an important way of organizing microblogs. By organizing the microblogs to a topic that are relevant to the same subject can remit the problem of mass information more or less, but a topic still contains a huge number of microblogs. In fact, there are multiple closely related subtopics in a topic. So how organize subtopics under a topic gradually becomes a serious problem. This paper will deeper study TDTs’ technology to build subtopics by using information and theme vectors of microblogs, as well as considering of location, participator and other factors that are benefit of building the subtopics. In addition, this paper proposes a novel algorithm called Label Rank to extract subtopic tags, which is based on random walk model. My major works and innovations are as follows:(1) In the data preprocessing of microblogs, this paper firstly filters out meaningless microblogs, then uses a regular expression to clean data, such as removing URL link and "@ user" and etc. Finally transforms the Complex font of data to the simplified Chinese based on Wikipedia’s Jane traditional corresponding table.(2) This paper proposes a novel method to build subtopics, which is based on miroblogs content and theme vector. This method considers the factors of location, participator and others that are benefit of building subtopics, and constructs participator vector, location vector, keyword vector and theme vector of microblog, then calculate the similarity between microblogs by these vectors. The experiments show that this approach can effectively build subtopics under the topic, compared to only build theme vectors, the value of 1F improved 4.2%, the value of DET dropped 8.6%. As well, this approach obtains nice result on those subtopics that the subject of them is too similar distinction to distinguish.(3) This paper presents a novel method called Label Rank to label subtopics. The algorithm builds a weighted graph of words co-occurrence based on LDA and uses a random walk model to figure out the significance of keywords, then selects the Top-K keywords of the calculation results as subtopic labels. Experiment proves that Label Rank algorithm can effectively extract subtopic labels.(4) Within the framework of the monitoring of public opinion monitoring system, YHPODS, this paper designs and implements a regional subtopic system. This system uses the distributed cluster storage mechanism of Cassandra and the relational database Oracle to store data. Based on the library of Chinese place names, this paper designs and implements the algorithm of gathering regional topics microblogs, as well as implements the algorithm of mapping IP addresses to geographical and the method of building subtopics based on content and themes.
Keywords/Search Tags:Subtopic, Microblog, LDA, Topic Label, Vector Model, Local Topics
PDF Full Text Request
Related items