Font Size: a A A

Analysis Of Deep Web Query Interface Discovery And Schema Extraction Technology

Posted on:2013-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:X HeFull Text:PDF
GTID:2248330362974825Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As the development of Internet, lots of web applications go deep into people’severyday life, and search engine is one of the excellent apps. The web data resource thattraditional search engines cannot find is called Deep Web. And data mining of DeepWeb is a hot topic in research of web data management and integration.The current Deep Web research focuses mainly on two cases which include deepintegration of query interface and schema extraction of query results. However, themeaningful integration system is not perfect yet. As a basis of the system, it is quiteimportant to discover, classify, and deal with the query interface of Deep Web in acorrect and efficient way. The main work in this paper is to find an automatic deepquery interface discovering and feature extracting technology, which is applied toefficient deep integration of query interface.1. The tag property, visual property, and hierarchy property.By analyzing a large number of Deep Web pages, we find the tag property, visualproperty, and hierarchy property of Deep Web pages.1) In case of tag property, weanalyze the HTML documents of Deep Web pages, and convert the tag structure intotree structure, so that we can make better use of computer on data processing.2) In caseof visual property, we analyze the layout of Deep Web pages, and convert the visualproperty of each tag into visual block, which forms the whole page.3) In case ofhierarchy property, according to the visual property when people browse pages and thetree structure of web page tag, we construct the visual layer corresponding to thehierarchy of tag trees. Page is the superposition of visual layers.2. Discovery of Deep Web query interface based on hierarchy.Combined with the tag property, visual property, and hierarchy property of DeepWeb pages, we propose a Deep Web query interface discovering method based onhierarchy. Such method first constructs page tag trees by analyzing overall tag structure,and constructs visual blocks by analyzing visual features of tags. Then, each level of tagtrees is converted into visual layer according to the visual blocks, which is used toanalyze the tag property and visual property of query interface, and the characters of thepage core area occupied by interface. Finally, the query interface is judged bycomputing the degree of polymerization of the control tags.3. Feature extraction of query interface based on latent domain. The query interface of deep web is composed of several controls and phrases. Inthis paper, we convert the query interface into a plain text, and handle the featureextraction of query interface in the way of handling text. We propose a featureextraction method based on latent domain. According to the fact that every phrase has acertain probability to be part of different topics and domains, this method clusters thequery interface text, and judges the latent domain of the text, and finally extracts thewords that related with Deep Web sources domain as the features of query interface.
Keywords/Search Tags:Deep Web page, tag trees, polymerization degree, hiberarchy, latent domain
PDF Full Text Request
Related items