Font Size: a A A

Research Of Blog Similarity Analysis Based On Hidden Markov Model

Posted on:2013-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y P SangFull Text:PDF
GTID:2248330377958812Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of Internet technology, more and more blog web pagesof similar structure and same content appear. These Blog web pages will occupy a lot ofnetwork storage resources and at the same time, the accuracy of Blog retrieval for users willbe reduced, so it is necessary to analyze Blog similarity to improve the efficiency of Blogretrieval. That’s why HMM is applied to a Blog webpage recognition.This paper first analyzes the algorithm of text similarity and web similarity, andintroduces the analysis and identification of both web and Blog. It also puts forward usingDocView model to define the webpage, uses semi-structured data to identify webpage contentand filters web pages by using the text similarity algorithm; secondly, it analyzes the Blogwebpage structure characteristics, on the basis of which Blog web pages are defined and at thesame time puts forward the method for Blog webpage coarse classification and detailedclassification.This paper studies HMM characteristics and draws a conclusion that the HMM can beapplied to the Blog webpage identification. The three basic questions of HMM are assessment,decoding and learning (or training). Aiming at solving these three questions respectively, theforward-backward algorithm (Forward-backward), Viterbi algorithm and Baum-Welchalgorithm are proposed. Then the HMM based on Blog webpage structure similarity and Blogwebpage similarity is formed. After the two kinds of model construction are formed, HMMtraining process begins, which includes the training set, isolated words, feature extraction,machine learning, memory and prediction set prediction. The threshold is defined and modeltest is carried out.In this paper, the results obtained are:1. In the HMM design process, the general calculation of Blog webpage similarity isimproved. Recalculating the frequentness of key words improves the accuracy of Blogwebpage recognition.2. HMM which was rarely used to calculate Blog webpage structure similarity is nowused and greatly improves the recognition accuracy through constructing a kind of algorithm.
Keywords/Search Tags:Structure Similarity, Key word similarity, HMM
PDF Full Text Request
Related items