Font Size: a A A

The Research On Automatic Classification Of Deep Web Databases

Posted on:2010-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:D D ZhangFull Text:PDF
GTID:2178360272497085Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, more and more back-end online databases have emerged, they provide large information by their query interfaces to users, the information is not only large in number, but have better quality, and the capacity of the online databases is increasing rapidly. Because the information can only be accessed by filling out the query interfaces provided by these online databases, and the traditional web search engines don't have the ability of filling out the query interfaces, these search engines can't search the information. The information stored in the online databases are called Deep Web information, these back-end online database are called Deep Web databases, or Web databases. For making use of the Deep Web information effectively, many researchers are doing research on Deep Web, due to the information of Web databases belongs to the same domain usually, and the topic is single, some researchers have proposed to organize and integrate Deep Web information according to the domain, this idea has been recognized by most researchers. In order to integrate Deep Web information, the Web databases will be classified by domain automatically. So, in this paper, the automatic classification of Deep Web databases has been studied.Currently, there are mainly two types of methods for classifying the Deep Web databases, one is post-query; the other one is pre-query. Because post-query technique need to wait for the return of the query results, this is a huge waste of time, and when there are many attributes on the query interfaces, it will be very complicated to fill out the interface, so it is difficult to get good classification results. Therefore, in this paper, we will study the pre-query technique. Previous pre-query methods, such as B.He used the unsupervised method, he hypothesized that homogeneous sources were characterized by the same hidden models for their schemas, he studied clustering the query interfaces of Web databases by the attributes on the query interfaces; Qian Peng also brought forward to cluster Web databases by hierarchy structure; Luciano Barbosa employed supervised method, he proposed to make use of form text to classify the query interface forms of Web databases. This paper puts forward a new classification frame of Deep Web databases based on the idea of Luciano Barbosa. The frame is based on the following two ideas, (1) making use of centroid to enhance the weight of feature vector , because previous researches have found that the larger the similarity value of the feature vector and the centroid, the more the capability that the feature vector belonged to the domain of the centroid , so in order to improve classification precision, this paper gives to use the similarity value to enhance the weight of features which are common with the centroid; (2) In addition, Luciano Barbosa used the word in the form text to construct feature vector, but there are many words that have same semantics in natural language, the same as form text of Deep Web, therefore, based on the idea of Andres Hotho etc, this paper puts semantic information into feature vector, this semantic information is obtained by WordNet.In the chapter 3, this paper introduces the ideas proposed in detail. (1) First, introduce the new classification frame and the procedure of the classification frame detailedly. (2) Based on the idea of Luciano Barbosa, this paper proposes the form text extract algorithm of query interface form, the algorithm mainly extract the available text information of the form, and the attribute value of the form controls that is helpful to classification, first the form is parsed to DOM model, then extract the text information of form by traversing the node in the DOM model recursively. (3)When introducing how to construct feature vector, this paper mainly introduce the construction of semantic feature vector, first, extract the synset of word which can be searched from WordNet, then replace the word by its synset in the feature vector, the other words that can't be searched in WordNet are kept down., so the feature vector constructed semantically is composed of synsets and words. Especially, the step of stemming is different from the construction of traditional text feature vector, because we need search word in WordNet, the word must be consist with the word in the WordNet, therefore, this paper use WordNet to stemming in this paper. (4) When introducing how to enhance the feature vector,this paper first introduce how to compute the weight of centroid feature vector, that is computing the weight by the Ave method , and introduce to enhance the feature vector of form text by computing the similarity value of feature vector and centroid, this method is a heuristic method. (5) At last, this paper introduce the models and data structures for system implementation, they are achieved by Java which is a cross-platform programming language. The chapter 4 is the experiment part. The dataset used in this paper is the set of Deep Web query interfaces which belong to six different domains, we select them from UIUC dataset. We extract the form of the six domains from html pages by programming the program of form extraction, then we construct the dataset for learning of classifier, the dataset is composed of form text feature vector. In addition, we introduce four standards of evaluating the performance of classifiers, Accuracy, Precision, Recall, and F-Measure. Finally, we validate the two ideas that this paper proposed by four experiments, every experiment use 10-fold cross validation.The idea of the first experiment already has exist, only use the words in the form text to construct the feature vector, then do the experiments using four classification algorithms, we found the result of SVM is better than others, so in the following three experiments, we only use SVM algorithm, this experiment is used for comparing with other three experiments. The second experiment makes use of centroid to enhance feature vector, this is the first idea this paper proposed, the result of this experiment is better than the first experiment in most domains of datasets. The third experiment puts semantic information into the feature vector, is the second idea this paper proposed, the result of this experiment is also better than the first experiment. The fourth experiment combines the two ideas this paper gave, its result also is better than the first experiment. Therefore, the ideas this paper proposed can improve the ability of classification.The chapter 5 is the application introduction of Web databases classification, this chapter mainly presents that combining the form structure classifier and the form text classifier to identify the entry of Deep Web, the benefits, first and foremost, is that the learning task of the domain-specific form text classifier is simplified, the overall classification process is more accurate and robust. This idea partitions the complicated features of HTML forms into structure features and text features, and applies different classifiers to the different partitions, and organizes these two classifiers in a sequence. The form structure classifier utilizes the structure features to eliminate the non-searchable forms, the form text classifier makes use of the available text information on forms to verdict a searchable form is the entry of the specific domain or not. In the end, we do an experiment on the Books domain by combing the classifiers, we find the accuracy is above 96%.Finally, we make a summary to this paper. Although both of the two ideas this paper proposed can improve the classification result, the level of increase is not large. In subsequent work, we will present a more effective way to improve the classification results, and will consider the domain ontology.
Keywords/Search Tags:Deep Web, Web databases, query interface, centroid, WordNet
PDF Full Text Request
Related items