Font Size: a A A

The Strategy Of Topic-specific Web Crawler Based On Semantics Similarity

Posted on:2010-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y K YangFull Text:PDF
GTID:2178360275499912Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With Internet growing exponentially, the amount of digital information becomes very large, the types are various and the updating is very fast. According to the statistic of Forrester Research, there are about one or two 10 bytes digital information per year. Data mining becomes the main method to find relevant information in the web, and search engine is considered as the important tool to retrieve web resource. With the amount of web sites and documents growing even faster and site contents getting updated more and more often, the large-scale search engines cannot grow that fast and they cover an even decreasing segment of the web. Google has recalled about 8 billion web pages, the scale is about one to five hundred with the whole web which grows 60T (1T=1012) bytes web pages a day.Because of this problem, topic-specific search engine has become a hot research, it only craw the web pages with the subject to meet a specific demand. It has some advantages, such as spending less time, requiring smaller storage space and meeting the needs of the users' personalization. Above all, the crawling strategy is the most important part of the focused search engine. In the literature, how to order the unvisited URLs was studied deeply, they calculate the prediction score based on the unvisited URLs' ancestor, however there are two problems, one is that the URLs in one web page are considered to have the same scores which could not be operated with the text context, in other words, they consider a web page has only one topic. The other is that the prediction scores of unvisited URLs are not calculated based on semantic similarity. To solve these problems, we present the semantic-based crawling strategy. It is consisted of two parts, firstly, the concept similarity context graph is put forward based on the formal concept analysis, in which we could calculate the semantic similarity between concepts so as to determine the orders of next crawling, find out the concepts which could reflect the purpose of users' queries at the same time. Secondly, web pages are parsed into Dom-Tree structures, combined with text semantic similarity and web page hierarchy, the URLs in different paragraphs are given different predictions.The main research works of the dissertation are summarized as following:(1) The concept similarity context graph is presented based on formal concept analysis (FCA) by calculating the similarities between the concepts and the core concepts to predict the URLs' scores of next crawling. The main idea is to build a formal context using the crawled web pages, a concept lattice is generated, and then the similarities between concepts could be calculated to build the concept similarity context graph. The main difference is that, layers are generated not only through the links of URLs, but also the semantic similarities between concepts. At last, the unvisited URLs' prediction scores which decide to the crawling order are calculated according to the layer of concept in the graph.(2) We present a kind of similarity calculation based on the combination of edit distance and vector space model (VSM). The terms in paragraphs are independent in the traditional method of calculating the text similarity, ignoring the terms' sequences. In other words, we could express different meanings using the same terms in the case of different sequences. We calculate the text similarity using vector space model which considers the query or paragraph as a vector in which the terms are independent, then edit distance is used based on terms' sequences to avoid it. The combination of the two methods could achieve better results.(3) A method used to predict the unvisited URLs' orders is introduced using the web page's internal hierarchical structure. Our elicitation is from meta-data extraction and there may be several topics in a web page. Firstly, we divide a web page into several layers and make every part has only one topic, and then there would be relations between the paragraphs in accordance with the hierarchy structure. Web pages are parsed into Dom-Trees, some rules are presented by the internal characteristics of hierarchical structure, and we contact the paragraphs combined with the similarity of using VSM and edit distance. At last, the unvisited URLs' scores are calculated based on the corresponding paragraphs' scores.Some experiments are done, and the accuracy of the results show that the strategy proposed in this paper is superior by comparing with several other crawling strategies, prove our model algorithms are ascendant.
Keywords/Search Tags:search engine, topic-specific spider, formal concept analysis, Dom-Tree, edit distance
PDF Full Text Request
Related items