Font Size: a A A

Research On Chinese Blog Information Gathering And Opinion Retrieval

Posted on:2010-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:T HeFull Text:PDF
GTID:2178360302959840Subject:Information security
Abstract/Summary:PDF Full Text Request
Along with the development of internet information technology, network media's position rises constantly, and its influence has penetrated into politics, economy and life, especially the public opinion. Because public opinion information on the internet can be produced and spread more quickly, public opinion on network receives a lot of attention. And how to effectively access and analyze public opinion information on network becomes a research hotspot.Web content security mainly researches the security based on web content, which is an important part of network public-opinion research field. It involves text classification, text clustering, topic detection and tracking, etc. Among them, clustering is a major technology to realise the web content security. The information of network public-opinion research comes from BBS, blogs and web pages, which is different from traditional long text, and we can call it Chinese network short text. Since Chinese network short texts are less of keywords and full of anomalous words, the traditional text clustering methods are not suitable to be used directly in network short text clustering. So, this thesis presents a clustering approach based on the immune network regulation for Chinese network short text. First, Chinese n-gram chunks are extracted and transformed to Chinese Pinyin to form the features of a Chinese network short text, so as to relieve these two characteristics'bad influence on the clustering performance. Then, the network short-text set is constructed as a dynamic network and an immune network learning mechanism is used to learn the similarity among short texts in order to get a better clustering result.As a very important information source for network public opinion, the research on blogs has become a hotspot. First, blogs are data updated frequently . When we do some research on blogs, how to gather the real-time information effectively is a basic step. This thesis focuses on the gathering technology for real-time information of blogs, and implements a gathering system based on the collection of the blogs'feed pages. And using this system, a large collection of Chinese blog data is gathered, which can be used in the research field of Chinese blogs.The data on blog space are mostly personal opinions and views, which are very subjective. Therefore blog space contains much public-opinion information, and the analysis and study on blogs can help us research network public opinion more effectively. Against this background, this thesis researches the opinion retrieval of Chinese blogs . It proposes a multiple-strategies ranking approach basing on fully analyzing blog data's characteristics which are different from other Internet data. These strategies include the content relevance retrieve ranking score, opinion ranking score, time characteristics score and comments characteristics score. And the thesis also uses the clustering method for Chinese network short text, and adds an immune-based clustering score. The multiple-strategies ranking method improves the performance of the blog opinion retrieval largely.
Keywords/Search Tags:blog, information gathering, opinion, network short-text, clustering, multiple-strategies ranking
PDF Full Text Request
Related items