Font Size: a A A

Extract SCHEMA From Web Query Interfaces

Posted on:2008-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z T HeFull Text:PDF
GTID:2178360212496018Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Nowadays, people more and more depend on search engines when they are finding information on the Web. Current search engines only follow hypertext links, but ignore HTML forms contained in some websites. These forms are supported by web databases, which provide a lot of useful information for users. The set of Web pages that can be dynamically generated in response to form-based user queries are referred to as the Deep Web [BrightPlanet] or Hidden Web [DanielaFlorescu]. The number of such dynamically generated Web pages on the Deep Web is estimated around 500 times as the number of static Web pages on the surface Web. All the query interfaces are described by natural language for human users, but they are hard for machine to understand. In order to fill the query interfaces and integrate the results which get from databases automatically, we have to deal with a list of problems: extracting schema from query interfaces, schema matching, translating queries between different forms and integrating the results.There are many reseachers focusing on these problems, they try to develop a uniform query interface for users. When users fill in the uniform query interface, the computer can translate this query interface to several local interfaces and access these local web databases and get data from these local databases, then the computer can understand these results and integrate these data. Finally, these integrated data will be displayed to human users.There are a number of challenges in automating the process of integrating data from multiple Web databases. The first challenge is to semantically understand search forms embedded in query interfaces. Search forms are written in HTML, which are displayed for human use. To make search forms machine understandable, formal representations need to be extracted from them. This enables queries to be translated and filled into local query exactly.In this paper, we focus on the first problem, how to semantically understand search forms. There are many reseachers proposed several approaches on this problem. In these theories, one of the most important approaches is proposed by Zhen Zhang et al. They found out the mostpopular models used by HTML designers depending on the position of labels and elements. They viewed a query interface as a visual language, whose composition conforms to a hidden grammar. The semantic understanding is thus a parsing problem. One of the other important methods is proposed by Hai He et al. They discovered that a substantial amount of semantic information is"hidden"on a query interface. For example,"Publication date"implies that the attribute semantically has a date value type. According to this observation, they proposed a list of heuristic rules based on domain type, value type and relationships among elements. They use the position information and these heuristic rules to understand query interfaces.Most of the proposed methods did not consider semantic information, they only make use of the position information to understand the query interfaces, so some labels and elements are often confused by their positions. In this paper we decide to make use of semantic information of the texts in the query interfaces. We use WordNet, one of the most popular dictionaries, to understand the words appeared in the labels and elements. WordNet is a large lexical resource combining features of dictionaries and thesauruses in a unique way that allows for a fresh perspective on the semantics of nouns, verbs, and adjectives and offers new possibilities for exploring the internal structure of the lexicon. WordNet is a widely used semantic network which is organized by synset and words are the most elementary units in it. Although WordNet is organized by synsets, it is not easy to enumerate synsets. Between synsets there are some relationships, such as hyponymy and meronymy. The lexical database WordNet is particularly well suited for similarity measures, since it organizes nouns and verbs into hierarchies of is–a relations, so we make use of WordNet to compute the similarity of the words in the query interfaces. We proposed a new approach base on the position information and semantic information of the query interfaces:First, we noticed that the layout distance is not a line distance as people using now, so we proposed a tree structure to describe the structure of query interfaces based on Dom tree of the HTML file.Second, we defined two position measurements based on the tree structure, depth distance and branch distance.Third, we used the Edge Counting Method to measure the semantic similarity of two words. This method is to use the path length betweenconcepts. It regards WordNet as a graph and measures the similarity between two concepts by identifying the minimum number of edges linking the concepts, e.g., the shorter the path from one node to another, the more similar they are. Based on the semantic similarity of two words, we proposed a new approach to measure the semantic similarity of two sentences.We developed a new approach to get an attribute tree (SCHEMA ), in which the layout of labels and form elements and semantic similarity between labels and form elements have both been taken into account. Finally, we tested 114 query interfaces chosen from seven domainswhich often are used in web databases. In this experiment, by using the approach based on both semantic information and layout information, we get a much higher accuracy. Finally, by analyzing the results we got a possible improvement.
Keywords/Search Tags:Interfaces
PDF Full Text Request
Related items