Font Size: a A A

Research On The Deep Web Search Interface Identification And Extraction Technology

Posted on:2012-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:L YangFull Text:PDF
GTID:2248330395955407Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Accessing Web database has gradually become the main means of searchinginformation. How to automatically retrieve information on the Web database hasbecome a hot spot within database research community. Research on search interfaceidentification and extraction paly an important role in the Deep Web data integrationsystem.Search interface identification aims to distinguish search interface from forms inweb pages. The development of dynamic web page techniques, especially theemergence of Script Language JavaScript, has a significant impact on the manifestationsand submission way of the form. This article utilizes Rhino engine to analyze theJavaScript codes in a HTML form. Based on previous research, this paper also designsand implements a method of search interface identification which is based on maximumentropy model. Experimental results show that the accuracy of the query interfaceidentification is higher than95%.The inherent difficulty of search interface extraction is to match form controls andtext references which express semantic information. Search interfaces are divided intofour types according to its structure, and then for its characteristics of different structure,the method for matching attributes is given. Finally, based on the DOM theory, theattributes extraction and matching of search interface is implemented. On the basis ofthat, an improved method is proposed in this paper, that is, a method of the searchinterface extraction based on the path index. The experimental result shows that theF_measure of search interface extraction can achieve94%or above.
Keywords/Search Tags:Deep Web Search Interface, Search Interface Identification, Search Interface Extraction
PDF Full Text Request
Related items