Font Size: a A A

Deep Web Query Results Extraction And Annotation

Posted on:2011-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y XieFull Text:PDF
GTID:2178360305455194Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
carry this information, Web databases appeared. The information is loaded in the Web databases, when users want to find this information, they just need to fill the entry forms of Web databases, which are called query interfaces forms also. Web sites which contain Web databases are called Deep Web. Deep Web is rich in information, so it gets more and more researches. Now, the researches on Deep Web mainly contain Query Interfaces Integration, Query Processing and Query Results Processing three parts, of which, Query Interfaces Integration part includes Web Databases Discovery, Query Interfaces Schema Extraction, Web Databases Classification and Query Interfaces Integration four units; Query Processing part includes Web Databases Selection and Query Transformation two units; Query Results Processing contain Query Results Extraction, Query Results Annotation and Query Results Combination three units. Putting the three parts of Query Interfaces Integration, Query Processing and Query Results Processing together forms the Deep Web Data Integration System.This paper focuses on the units of Query Results Extraction and Query Results Annotation. Query Results Extraction means to mine and extract data records from the returning query results page. Query Results Annotation means to add semantic label to each data item of data records.In the unit of Query Results Extraction, this paper uses the method based on HTML tag tree. As the data records in the same returning result page have high similarity in structure, which is in fact manifested on the tag tree that forms them, so, by turning the pages to tag trees, can identify data records base on the similarity of tag trees. In this paper, after a large number of observations over returning result pages of Deep Web sites and their source code, we summed up the characteristics of data records in the structure of tag trees, and put forward the definition of data record which is different from before. This method is divided into two steps: (1) Building tag trees of web pages; (2) Mining data records. In step (1), this paper use HtmlParser to parse the page, the result is saved in a parsing tree. The type of parsing tree node is Node interface which is defined by HtmlParser. Node has three types of implementation class, which are TagNode, TextNode and RemarkNode, in which TagNode represents the tag nodes of Html code. Traversal the parsing tree top-down to find out TagNode which are used as tag tree node to construct a tag tree, delete some useless node that express style; In step (2), the process of mining data records is a recursive process, starting from the root of tag tree, set the root node as real parameter of process, check whether the node is a data record. If so, then find the recursive exports, extract the content of the node and the process is over; if not, then find all the child nodes of the node, use them as real parameters to call the process recursively in turn. In the process of checking whether a node is a data record, the most important link is calculating the similarity of sub-trees of a tag tree, which is completed by applying edit distance algorithm. Traverse the two sub-trees of the tag tree to turn them to two tag node sequences first, and then use them as real parameters to call the edit distance algorithm. As can be seen from the above, different from many methods based on tag trees, the biggest feature of our method is that it is not necessary to mine data region, but to mine data record directly.In the unit of Query Results Annotation, this paper adds semantic label to each data item by using Ontology combined with heuristic rules. Query Results Annotation unit contains Ontology Management Module and Semantic Annotation Module. In Ontology Management Module, extract concepts which contain main-concepts and sub-concepts according to characteristics of many Deep Web query interfaces schema, establish Ontology library, and maintain the sub-concept table candidate-concept table in Ontology Manager in order to modify Ontology automatically. Sub-concept table stores all the sub-concepts of each main-concept in Ontology library, it ensure the consistency of Ontology concepts; candidate-concept table stores concepts which can not be matched by system. These concepts need confirm by domain experts to be sure whether they are domain relevant or domain irrelevant. For domain relevant concepts, put them into Ontology library as main-concepts, and updates the sub-concept table; for domain irrelevant concepts, do nothing to them and just give them up. By enriching the sub-concept table and candidate-concept table, Ontology's ability on distinguishing domain semantics is also enhanced. In Semantic Annotation Module, need to pre-process the data records extracted in Query Records Extraction unit first, standardized each data item of data records, keep the text content and remove the image content. Then, for a data item to be labeled, determine whether the data item belonging to semantic-based data item or content-type data item first.For semantic-based data item, use main-concepts of Ontology in Ontology library or their sub-concepts to match with its described text. If they matched successfully, use the main-concept of the matching concept instead of the described text to label the data item; if they match unsuccessfully, put the described text to under-judge-concept table, and view the text after described text as content-type data item. For content-type data item, use the instances of Ontology in Ontology library to match with it. If they matched successfully, use the main-concept of the matching instance as described text to label the data item; if they matched unsuccessfully, then use heuristic rules to it. Query Results Extraction experiment uses many book domain Deep Web sites'query results returning pages to do test experiment for setting thresholds. After test experiment, knows that when thresholds H, L, S are set to be 2, 10, 0.9, F index reached the maximum 96.8%; then, under this thresholds set, do experiment by using some Chinese book domain Deep Web sites'query results returning pages, precision and recall are 100% and 98.2%, the effect is perfect. The data records this experiment extracted are used to the Query Results Annotation experiment. By experiment, The precision and recall of Query Results Annotation are 98.1% and 90.4%, F index is 94.1%, the effect reached the requirement of real application, but need improvement.
Keywords/Search Tags:Deep Web, extraction, annotation, tag tree, Ontology, heuristic rules
PDF Full Text Request
Related items