Font Size: a A A

The Focused Web Crawling Strategy Based On Incremental Learning

Posted on:2011-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q GaoFull Text:PDF
GTID:2178360308970906Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Since the Internet appeared, the search engine has become the main way obtaining information. However, with the rapid development of Internet and demand of network users, the traditional search engines, such as Alta Visa, Google, and Yahoo and so on, showed some limitations, but topic search engine is one of the most critical hotspot to solve the problem. In the Focused Crawling, the focused crawler craw the web pages with the topic to meet a specific demand. Its advantage is to retrieve pages relevant to the topic, while traversing the fewest possible pages which is not related to the topic. So it can reduce the collection range, and improve resource utilization.Taking into account the actual Web page on the Internet is changing constantly. During the different period, some pages appear and some pages disappear. Through in-depth studying of the principles, characteristics of topic search and the web page constantly changing, the author in this paper applies the thinking of incremental build concept lattices into the topic search based on format concept analysis; put forwards the Focused Web Crawling strategy based on Incremental Learning which means the focused crawler having a certain ability to learn.The main research works of the dissertation are summarized as following:1) It applies the thinking of incremental learning to the focused crawling. In view of formal concept analysis of the feasibility in the topic search, the Concept Context Graph (CCG) which is gotten from the concept lattice as the background knowledge to guide the focused crawler is a good innovation. In order to reflect the web pages'changing, it need update the CCG in timely. That is to say that the process of updating CCG that is the process of incremental learning, which is by adding the topic relevant concepts and deleting the topic irrelevant concepts in the CCG.2) It updates the CCG by adding the topic relevant concepts. Firstly, it selects the topic relevant pages from the search results. Then, it gains the Incremental Concept(IC) based on these topic relevant pages by the algorithm mentioned in this paper and adds these IC into the CCG.3) It updates the CCG by deleting the topic relevant concepts. It finds some pages which is not relevant to the topic, and deletes these concepts which are gotten by the topic irrelevant pages from the CCG.4) The experiments show that it is feasible of crawling strategies proposed in this paper. At the same time, it gives the results before and after updating CCG and the results contrasting with the other two crawling strategy, CLCG and CSCG. Finally, it analysis the accuracy of the results and proves the superiority of this method.
Keywords/Search Tags:Search Engine, Focused Web Crawler, Format Concept Analysis, Concept Context Graph
PDF Full Text Request
Related items