Font Size: a A A

Research On Key Technologies Of Deep Web Data Crawling

Posted on:2019-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2348330569987727Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development and update of internet technologies,the internet has been inseparable from many aspects of people's lives.How to discover and mining interesting data resources in the vast ocean of internet information has gradually begun to transition from the technical research to the information retrieval of ordinary users.Generally,the information resources on the internet can be divided into Surface Web and Deep Web according to the difficulty of obtaining information.The data in Surface Web are mostly nested in web pages in the form of URL links,so they can be retrieved through traditional search engines.However,the data in Deep Web cannot be obtained through direct indexing.Most of them exist in the back-end database of the website.We need to find the query interface and simulate submitting queries before they can be obtained.Currently,the data in Data Web are known to hundreds of times larger than the data in Surface Web and are still growing rapidly.Therefore,how to take full use of the Deep Web data is an important way to obtain internet information resources.In this context,this thesis focuses on the problem about query interface of Deep Web data crawling,including the discovery of query interface and the extraction of query interface schema.This thesis mainly includes the following contributions:(1)This thesis proposes an improved method for Deep Web query interface discovery.Firstly,the positioning problem of Deep Web query interface is researched,and a new method based on visual information of web design is proposed.The method mainly performs area division on web data,utilizing the layout and style features of web data,and finally locates the interaction interfaces of the web by matching the rules based on visual information.This positioning method circumvents the limitations of interface positioning based on <form> tags.Then,for the Deep Web query interface recognition problem,this thesis proposes an improved method to recognition the query interface by combining the structural and textual features of the interface,which improves the problem of low accuracy and adaptability for classification caused by a lack of textual features.In the experimental test,the positioning method of web interaction interface achieves a very high positioning correctness,and the improved interface classification feature set achieves a higher classification performance in the meantime.(2)This thesis proposes a three-stage framework about the schema extraction problem of Deep Web query interface.The problem of query interface schema extraction is divided into three stages: the construction of element tree about query interface,the label attaching of tree node and the extraction of meta-information.On the framework,we first improve a method for constructing the element tree of query interface based on recursive hierarchical clustering.The method integrates the potential information of HTML tags and the spatial layout features between interface elements,and as a result the drawbacks of low adaptability caused by a lack of the potential information of HTML tags are improved.The experimental verification shows that the improved method is more adaptable than the original method and also has better experimental test results.Then,based on the label attaching problem of element tree nodes,a rule-based method is proposed,and seven heuristic rules are summarized and extended to guide label attaching.In the experimental test,the proposed rule-based label attaching method achieves a high rate of matching accuracy.
Keywords/Search Tags:Deep Web, query interface positioning, query interface identification, query interface schema extraction
PDF Full Text Request
Related items