Font Size: a A A

Research On Key Techniques Of Topic-Oriented Blog Resource Mining

Posted on:2012-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:W F XuanFull Text:PDF
GTID:2218330362450449Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of Web 2.0 era, the barrier for users to publish information onthe Internet becomes lower, making the volume of information on the Internetbecome extremely large. As a typical application of Web 2.0, blog is attracting moreand more users with its simple and convenient ways to use. In blogospherecomposed of blogs, users can record ideas according to their interests, read otherpeople's articles, and comment on them. As a result, there is a huge amount ofinformation about topics (or interests) in blogosphere. Under such circumstances, itis very difficult for users to find what they want from these huge amounts of data.So, it is becoming increasingly important to mine valuable information for users inthese massive blog data. To this end, this thesis studied three problems, and themain research contents include the following aspects:First, this thesis proposed a blog post keywords extraction algorithm based ontopic model Latent Dirichlet Allocation (LDA) after the analysis of existingkeywords extraction algorithms' limitation, namely dependence on externalresources and specific text format. Then the effectiveness of the algorithm isvalidated through a comparative experiments with TFIDF and Hierarchical HiddenMarkov Model (HHMM). Finally, this thesis analyzed the reason for the superiorityof the proposed algorithm from the point of view of the linear correlation betweenthe weight of keywords and their frequency using Pearson product-momentcorrelation coefficient.Second, through the comparative analysis of four typical clustering algorithmsK-means, K-means++, Affinity Propagation and Markov Cluster, this thesis selectedthe Markov Cluster algorithm which is more suitable for our specific application.Based on this, we designed a three-layer algorithm for the thematic clustering ofblog posts and automatically generation of clustering results' description. Then, wevalidated the effectiveness and stability of the designed algorithm throughcomparative experiments.Third, by combining with the specific characteristics of blogosphere, namelycomments and the phenomenon of reproduction, this thesis proposed a blog ranking algorithm with the integration of comments and text similarity, and gave validationto the effectiveness and stability of the algorithm by comparative experiments ontwo real blog datasets. Results showed that the proposed algorithm, under theNormalized Discounted Cumulative Gain (NDCG) indicator, got performance boostof 17% and 29% comparing with the traditional link analysis methods respectively,and had good stability as well.
Keywords/Search Tags:blog resource mining, topic model, keyword extraction, cluster analysis, preferential PageRank
PDF Full Text Request
Related items