Font Size: a A A

The Strategy Of Topic-specific Web Crawler Based On Semantics Similarity

Posted on:2011-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:Q Q PengFull Text:PDF
GTID:2178360308470996Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of the Internet and Web technology, the web pages are increasing at exponential rate every day. According to the present trend of the digital information, which is numerous, various and updated quickly, Web data mining has become the principle way to access the information, while people choose search engine as the important tool for obtaining information. There is a problem that the traditional Search Engine downloads web pages by crawler in the way of exhaustive which is not realistic, because it can't meet the needs of rapidly growth on web information. Simultaneity, some features of web information made mining data by search engineer really difficult, such as the mount of data is mass, the information is complex, the situation of internet is essentially dynamic, users have diverse characters, and so on.Therefore, topic-specific search engine emerge and cause a great deal of attention of researchers. The crawler of topic-specific search engine called Topic Oriented Crawler. Topic Oriented Crawler downloads web pages which related to specific theme from internet to meet users'query. The advantage of this method is less time spending, smaller storage space and can meet the users'personally need. Further more, it can choose the zone related to specific theme rapidly to crawl and download useful web pages by automatic identify the theme information. In this way, this crawler can effectively avoid crawling unrelated zone and prepare rich data resource for topic-oriented users'queries. But considered the complex of web structure and real-time of topic-oriented crawler, how to improve the crawler's ability to identify theme? How to download more theme-related web pages in less time? How to overcome unrelated web pages and get the user interested web pages? All above problem are matters on topic-specific crawling strategy which need to solve and has become a hot research points.Studying several existing strategies of topic-specific crawling, we find that these strategies about rank of predictive URLs are based on keywords. That is, the prediction of crawling direction has not succeeded in predicting by analyzing semantic. In this paper, we predict the crawling direction of topic-specific crawler by mining the content and link information of the downloaded web pages. Using the technology of Formal Concept Analysis, we do cluster analysis on text content firstly, then, predict the crawling direction by calculating the semantic similarity of concepts in concept lattice, so that , the prediction of topic-specific crawling direction will be in the level of semantic predictThe contributions of the dissertation are summarized as following:(1) We introduce concept lattice into semantic similarity calculate, and construct concept lattice as users context information by using the theme-related web pages which have been downloaded, then map the concept lattice into concept context graph. We calculate the concept similarity between web pages and concept context graph to predict the priority of URLs which will be crawling.(2) We propose a novel method of concept context graph's construction. There are lots of traditional construction approaches, such as Diligenti proposed link context graph(LCG) in [14], which entire based the link relation between web pages. ChingChiHsu proposed relate context graph (RCG) in [15] which add the similarity calculation in the web pages'link relationship. Our approach is based on concept lattice, which map every concept into the context graph according the relationship of attributes of the concepts to formulate the concept context graph.(3) We propose a topic-specific crawling strategy based on analysis of text content and link information. Though the concept content graph we calculate the web pages'semantic similarity, and ensure the downloaded web pages closer to the theme. Combining the web pages'link information to guide the crawling can make sure the crawler choose the correct crawling direction, and skip theme-unrelated zone to arrive the theme-related zone.(4) We obtain web data by constructing topic-specific search system. According the number of theme-related documents and recall as well as precision, we evaluate the performance of the search strategy. Empirical results indicates that this the strategy proposed in our paper produces significant improvements compared to width-priority search strategy and other popular search strategy on same datasets.
Keywords/Search Tags:search engine, topic-specific crawler, concept content graph, semantics analysis, links analysis
PDF Full Text Request
Related items