Font Size: a A A

An Approach Based On WSFT Model For Crawling Deep Web

Posted on:2017-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:H X LiFull Text:PDF
GTID:2348330488976104Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As web2.0 rising, currently the Internet application has extended two existing way: Surface web and Deep web. The former means the traditional search engine can search web page, which is composed by hyperlink and static page; and the latter means some source set that exists in Internet but cannot be reached by hyperlink. At present, the capacity of accessible information included in the Deep web is far more than the general Surface web. As a result, it is significant that studying the Deep web content access method for improving search engine coverage.In the technology used in Deep web, Ajax technologyhas become an important component part because it provides more fluentinteraction. The difference between Ajax page and general Surface web is that Ajax page is a mixture of multiple states, that is to say a page corresponds to multiple document structure, and there is strong association relation among document structure. The multi-state and strong association among states may conduce todata processing such as mining important content. However, current studies have not a data pre-processing method aiming at this characteristic of Ajax page.Text serve as the main carrier of information and most web mining methods are analyzed aiming at text.Meanwhile,the content information and structure information of page document is very importantfor web mining. This paper proposed a Deep web text access method based on WSFT(weighted state fusion tree):In the transition of multiple sates in an Ajax page, the higher frequency of a text block appears, the text block may be more important. This method mainly analyzed and processed text information of Ajax page, and preserved well content information and structure information of web page.First, a text feature tree was constructed, which is a specific data structure that can preserve effectively document content information and structure feature of original web page, and can serve as the information fingerprint of state transition, can effectively judge the sate change, and optimized Ajax page information collection method in Deep web.Then, the text feature tree was introduced into the collection procedure. In concrete implementation, through the event agent technology, activelytrigger various states of a web page, and convert each page to a text feature tree, and acquire a text feature tree set of a page, and convert condition each other to construct a state transition directed graph, and then compute the adjacent matrix of this directed graph.In the last, Using StatusRank algorithm compute the weight of each state, and merge all states into a WSFT, which provided valuable structured data for subsequent web mining(content mining and structure mining).From Internet application, some websites which have used Ajax technology were chosen to experiment. In this process, design and implement correspondingprototype system cl-fetcher, and analyze experimental result, demonstrate the method proposed by this paper is effective.
Keywords/Search Tags:Ajax crawler, weighted state fusion tree, Text mining, text feature tree
PDF Full Text Request
Related items