Font Size: a A A

The Research Of Web Query Interface Location And Schema Extraction

Posted on:2019-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:J T OuFull Text:PDF
GTID:2428330545474863Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As an important part of query interface integration,query interface location and schema extraction is an effective way to automate the search of Web database.It has important theoretical and practical significance for the large-scale,real-time,and diversified Deep Web information retrieval.On the basis of query interfaces' basic type and feature analysis,in this paper,two interface location algorithms are proposed to find interfaces for the problem that the existing query interface location technology relying on form label discovery and cannot locate the non-Form interface.Aiming at the problem that the existing schema extraction method uses text semantic similarity extraction with low precision,a schema extraction method using query interface structure and position features is proposed.The automatic identification and schema extraction system of Web query interface is designed and implemented.The main innovations and works are as follows:(1)Aiming at the insufficiency and misjudgment of the existing query interface location method that cannot locate the non-Form interface,a frequent common tree path interface location algorithm is put forward by studying the DOM tree's structure features of the query interface.According to feature that the same query interface controls contain a common starting node,the interface block is obtained by using the common tree path between the controls and the region correlation,query interface is determined quickly by using the C4.5 classifier.The experimental results show that the average precision rate of the algorithm's recognition for ticket interface reaches 93.39%,which is 48.21%,58.45% and 31.58% higher than that of the other three existing location algorithms.However,when the page contains multiple query interfaces,the algorithm generates local interface blocks.(2)In order to solve the problem of misjudgment of interface caused by frequent common tree path interface location algorithm generating redundant interface blocks,a spatial hierarchical clustering interface location algorithm is proposed by studying the spatial characteristics of interface nodes.According to the characteristics that the same query interface nodes are close to the DOM tree and the similarity of the tree path between them is high,the improved Euclidean distance is used to tune the locating result,query interface is determined quickly by using the C4.5 classifier.The experimental results show that average recall rate and precision rate of the algorithm's recognition for the ticket query interface are 95.71% and 96.07%,respectively,which is 1.77% and 2.32% higher than that of the frequent commen tree path location algorithm.(3)In order to solve the matter that the existing query interface schema expression model cannot reflect the defects of the elements' constraint relationship,an improved ticket query interface schema expression model is proposed by analyzing the characteristics of the ticket query interface.For the problem that the existing schema extraction methods use text similarity to extract attributes with low precision,according to the characteristics that the elements contained in the same attribute are closer in the Web visual and in the DOM tree,and follow certain combination rules,the page position,and mixing distance constraints of the elements are analyzed comprehensively,therefore,a double-constraint attribute extraction algorithm is put forward.The experimental results show that the average precision rate of the attributes extraction algorithm for auto ticket query interface reaches 93.01%,which is 81.06% higher than that of the DOM-based twophase cluster attribute extraction algorithm.(4)By using the theoretical research of the query interface automatic identification and schema extraction,a query interface automatic recognition and schema extraction system is designed and implemented.
Keywords/Search Tags:web database, query interface, location and recognition, schema extraction, system design
PDF Full Text Request
Related items