Font Size: a A A

Research On Topic Detection In Blogosphere Based On Content Analysis

Posted on:2011-11-28Degree:MasterType:Thesis
Country:ChinaCandidate:J Y HeFull Text:PDF
GTID:2178360332458124Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Topic detection technology is an unknown topic identification technology faced to text-oriented information flow, which is an important component of topic detection and tracking technology. This technology seeks a particular time and place events in expanded with more topics related to outreach, which has great practical value in the information extraction and monitoring of public opinion. At present, the most common topic detection algorithms are designed to deal with the news websites corpus. While the algorithm for Blogosphere is not mature. That is because Blogosphere is a personal media. The corpus from Blogosphere is more complex and has a huge number compared with news.This paper analyses deeply the structure of data from Blogosphere. It ascertains the main needs of topic detection on Blog data. This paper designs the topic model based on the character of Blog data. The model contains topic center and keywords set as main feature. The topic detection algorithm, the keywords extract algorithm and the special topic extract algorithm are based on the topic model. The main contributions of this paper are as follow:1. This paper designs the topic model base on the characters of Blog data. The topic model contains five features: topic name, keywords set, topic center, posts of topic, time of topic. The algorithms in this paper are all based on the topic model. The topic detection algorithm and the keywords extract algorithm create each feature of topic model. And the special topic extract algorithm is based on the topic model.2. This paper analyses various types of text clustering algorithms, and chooses the incremental clustering algorithm as the main component of topic detection algorithm. Three optimization strategies are imported: topic center update, text filtering, selection of topic models. By the experiment, it proves the efficiency of topic detection algorithm.3. The topic keywords extract algorithm is designed to extract keywords for each topic. The words contained in each topic are weighted by the mutual information formula. The word appeared in title is more important to describe the topic.4. The special topic extract algorithm is based on the topic model. It chooses three factures of topic model: keywords set, topic center, time of topic. This algorithm designs three different formulas to calculate the similarity of topic models. At last, it proves the efficiency of special topic extract algorithm by the experiment.Based on the above studying, this paper designs the topic detection system base on Blogosphere. The system is composed by five modules: database module, data pretreatment module, topic detection module, topic feature extract module, special topic extract module. This system is the base of topic detection research in Blogosphere.
Keywords/Search Tags:Blogosphere, topic detection, topic model, special topic extract
PDF Full Text Request
Related items