Font Size: a A A

Research Of Focused Crawler Based On Semantic Disambiguation Hidden Markov Model

Posted on:2021-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:H M GanFull Text:PDF
GTID:2428330647460799Subject:Engineering
Abstract/Summary:PDF Full Text Request
The focused crawler utilizes the program to grab automatically web page resources related to a given topic from the Internet.Most of focused crawlers predict the topic similarities of unvisited URLs through the web content and the link structure,and take these similarities as the visited priorities of the corresponding URLs.However,above-mentioned focused crawlers do not take into account the polysemy of some lexical terms when web pages are represented,which lead to that the focused crawlers can not acquire accurately the representation terms and the topic similarity of the web page.This will mislead the focused crawler to grab the direction of the web page,thus reducing the crawling effect of the focused crawler.In addition,these focused crawlers do not cluster the different web pages with the same link distance,which lead to that the focused crawlers can not acquire accurately the state probability of the grabbed web page linked to the target web page.It will also mislead the focused crawler to grab the direction of the web page,thus reducing the crawling effect of the focused crawler.To solve above-mentioned problems,this thesis proposes a focused crawler based on Semantic Disambiguation Graph and Hidden Markov Model.The important research works are as follows:(1)This thesis constructs the Semantic Disambiguation Graph(SDG).The SDG is used to remove ambiguous terms unrelated to the given topic in representation terms of a grabbed web page in order to determine the representation terms of the grabbed web page more accurately.The SDG is constructed by using the training web page set to extract the topic terms,and taking these terms as the nodes of the graph.The relationship strength between the nodes is measured by the number of the co-occurrence and relevant web pages in the Internet corresponding to the two terms.Through the fuzzy inference model,the ambiguous terms are identified from the topic terms corresponding to all the nodes in the graph,and the disambiguation term set corresponding to each ambiguous term is further extracted through the relationship strengths between the nodes.The goal of SDG is to remove ambiguous terms unrelated to the given topic,i.e.to remove ambiguous terms that are unrelated to the given topic in the representation term set of the grabbed web page,and to further optimize the representation term set of the grabbed web page.(2)This thesis establishes the Hidden Markov Model(HMM).The HMM is used to estimate the state probability of a grabbed web page linked to the target web page in order to predict the priorities of unvisited URLs in the grabbed web page.The HMM is established by taking the link distance from each web page to the target web page as the hidden state of the model,and the clustering class cluster of each web page according to the text content as the observation cluster of the model.The parameters of the model include the initial state probability distribution,the state transition probability matrix and the observation emission probability matrix.The three parameters of the model are estimated by using the number of web pages in different hidden states and different observation clusters in the training web page set.The goal of the HMM is to estimate the state probability of the grabbed web page linked to the target web page according to the sequence of observation clusters and the model parameters,and then to infer and predict the priorities of the unvisited URLs in the grabbed web page.
Keywords/Search Tags:Focused Crawler, Semantic Disambiguation Graph, Hidden Markov Model, Fuzzy Inference, Probability Model
PDF Full Text Request
Related items