Font Size: a A A

Deep Web Data Extraction And Annotation Based On Ontology Evolution

Posted on:2012-04-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:K R ChenFull Text:PDF
GTID:1118330335453002Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Generally, Web could be divided into two main categories:Surface Web and Deep Web, based on the characteristics such as information store and distribution status. Among them, the term Surface Web refers to webs that not only use hyperlinks to link resources like photograph and documents, but also they can only be read by clicking these hyperlinks. Unlike Surface Web, Deep Web always stores information in back-end database while only provides a query interface to users; it then will automatically produce webs contained query results based on the query conditions submitted by users.As traditional search engines mainly relied on hyperlinks for crawling, they are incapable of retrieving abundant information contained in Deep Web sites. In the meantime, there are various applications for information extracted and integrated from Deep Web sites; for instances, the price comparison service for products provided to multiple ecommerce web sites can both help merchants understand market, and create material benefits for consumers; in this way, it helps portals provide more professional and personalize information search service. Therefore, not only will the data extraction and integration of Deep Web produce substantial economic benefits, but it will also improve the recall rate and precision rate for traditional search engines.Ontology could be considered as a special type of sharing dictionary which has the characteristic of custom data structuring, and it's applicable for describing concepts and relationship between concepts in the specific domains of computer system. In addition, the use of ontology for extracting and annotating data can remove the reliance on the web structure existed in traditional processes. Thus, this paper proposes a data extraction and annotation method based on ontology evolution.There are mainly four fields in this paper regarding the specific research works:(1) Based on the structuring characteristic of Deep Web's query result page's data, this paper designs a quite simple 7-tuple flow ontology attributes model, which can describe attributes and attribute relationships in the ontology of domains effectively.During the process of constructing ontology, we have taken into account both the information of query interface and instance information of query results; as a result, a more abundant ontology would be able to be constructed. In addition, the attribute information in ontology is the union of query attributes in query interface pages and instance information in query results pages.(2) For data extraction process in query result pages, there are three phases as follows:identifying data record areas, segmenting data records, and aligning data records.Based on the observation that data records which usually contain a lot of ontology information possibly are data areas consist of query result records when multiple data records are existing in the web pages, this paper suggests a maximum relevance subtree algorithm to identify query result data areas; meanwhile, designs series of heuristic rules for segmenting process of data based on visual observations of multiple web pages.Extraction and annotation process adopts Partial Tree Alignment algorithm for aligning multiple data records produced in the same data source. The main idea of this algorithm is to construct an ascending seedtree for aligning multiple trees. We can consider one data record as a seedtree, and the number of seedtrees in one data record area will depend on the number of data records in that area; eventually, a seedtree that contains most nodes will be constructed, and this is the very one which can align all subtrees in the same data source.(3) For data annotation of query result pages, in order to avoid unnecessary annotation, we first studied algorithm of identifying duplicate data records. This method actually combined advantages of distance function based method and machine learning based method.Since ontology has preferable semantic information, this paper advances ontology for annotating extracted data. Regarding the instance information of label-value pairs extracted from data, there are two different handling associated with labels: If the label is not null, then ontology and label value will be mapped, and an appropriate label will also be annotated for that instance.In case the label is null, based on the observation that Deep Web back-end server would return as much query result information as possible if more reasonable query conditions have been chosen in query interface of Deep Web, this paper proposes a method of resetting query conditions, which uses the number of data records returned by Deep Web query results to determine how to annotate that instance. In the meantime, this paper also suggests a K-beam search algorithm based on KBFS for predicting annotation of data instances. In fact, not only does this method have prediction ability of prediction model based on maximum information entropy models, but this method also possesses KBFS search algorithm's advantage of seeking optimal path.(4) In order to avoid the finiteness of static ontology's knowledge representation, this paper advises a type of dynamic evolution ontology for data extraction and annotation. The evolution-process could be divided into four stages:catching change information, expressing change information, semantic change, and execution of ontology evolution. In addition, three basic rules for ontology evolution have been made, which can ensure that ontology after evolution would have more abundant information without the problem of semantic conflicts.Although this paper has studied on extraction and annotation of Deep Web page data intensively, some significant techniques are not mature enough and need further works; for instances, it's necessary to create a set of evaluation standards for ontology evolution, in order to avoid the overexpansion of ontology information. Therefore, there are still a lot of works that need further improvements and innovations.
Keywords/Search Tags:Deep Web, Data Extraction, Data Annotation, Domain Ontology, Ontology Evolution
PDF Full Text Request
Related items