Font Size: a A A

Research And Implementation Of Search Engine Prototype Based On Deep Web Crawler

Posted on:2011-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:K TanFull Text:PDF
GTID:2178330332985827Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the rapid development of Internet, the simple, extensible, platform-independent web technology gradually prevailed. Dynamic web pages are gradually replacing static pages. Based on some characteristics of dynamic web pages, it also become an inevitable trend that the deep web comes out. Search engine in a certain extent accesses web information collection for users, which provides an effective way. Traditional search engine mainly presents unstructured data to users, but as a structured data, the data of the deep web apparently can't be achieved through the traditional search engine. With the deep web study for the deepening of the research, getting the deep web data through search engines has become a new task in the field of the deep web.This paper mainly constructs a search engine prototype, which is based on the deep web crawler and the core framework of Lucene. Throughout the prototype system realization process, we mainly discuss and research the deep web search, the deep web querying interface judgement, the deep web surface preprocession and the inputs for associated form querying templates. In this paper, querying interface judgement is based on the principle of the DOM tree.In the deep web surface preprocession, this paper puts forward a algorithm of selecting association form inquires templates. This algorithm is mainly based on modeling the form input values and analyzes the process of the form page query. Through the weighted technique, inputs which are used to fill the query forms templates can be selected. Finally the corresponding backend database query link obtained, and the deep web data also be obtained.In the search engine architecture, this paper mainly uses the Lucene open-source search engine framework which offers two core classes, namely the core index class and the core search class. The crawler will have climbed to get the data content and then save the data into the index of the repository Lucene system. Through the core search class which provides the search query interface to users, search engine prototype architecture based on depth web crawler just can be realized.
Keywords/Search Tags:search engine, deep web, form, dom tree, Lucene
PDF Full Text Request
Related items