Font Size: a A A

A Dissertation Submitted To Graduate School Of Southwest University

Posted on:2012-01-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:C M WuFull Text:PDF
GTID:1118330368990186Subject:Agricultural Resources and Environment
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, a large amount of web resources have been stored in Web database and been deepened gradually, forming a mass of Deep Web data. For its special way of data provide and access, Deep Web can not be indexed by traditional text search engines effectively, which brought difficulties for users to obtain and utilize these data. Deep Web data integration is a new research issue based on this background.The construction of global query interface is an important part of Deep Web data integration, which involves several key technologies. Although many solutions have been proposed, but in general, the research is still in exploratory stage and there are many problems to be solved. For example, in the research of Deep Web entries identification, it is lack of dynamic adaptation in rule-based method which is widely used, and there are still no effective solutions to distinguish between the simple query interfaces and general search engines; In the research of text-based Deep Web classification, there are still no scientific and quantitative criteria for feature selection. Because of the different role of features in domain classification are not deeply considered in weighting calculation, the vector model of query interface is not accurate enough, which affecting the accuracy of classification; In the research on query interface schema extraction, the vision layout information are not fully used in current works; In the research on schema matching among interfaces, the meta information of property are not fully used and the matching precision should be further improved. So, it still needs more innovative study both in research ideas and in research methods.Several key technologies involved in building the global query interface were studied in this dissertation. On the basis of summarization to the achievements both in the domestic and foreign, we undertook thoroughly and meticulous analysis to every issue and proposed the corresponding solutions or improvements for the defects of existing works. Through the theoretical analysis and a set of experiments, it proved that our proposed approaches have high accuracy and feasibility, and there has practical values. In the end, a "comprehensive soil information search" experimental prototype system was programmed using the soil data as examples, which discuss the applications of data integration technologies in agriculture domain.(1) Research on automatic identification of Deep Web entriesIdentification of Deep Web entries is the foundation for Deep Web data integration. Unlike the previous rule-based method, we adopt the idea of machine-learning and proposed an approach to identify Deep Web entries by using neural network. Firstly, the influencing factors were considered comprehensively including form controls, control properties, property values and some keywords, and then, an identification model was builded; Secondly, some features were selected based on the statistical results as the basis of distinguishing between query interface and non-query interface. Finally, the self-learning mechanism of neural network was used to automatically adjust the parameters of influencing factor, which avoid the subjectivity and lack of dynamic adaptability in traditional rule-based method. The experimental results showed that our proposed approach can achieve high accuracy and worked well in distinguishing between simple query interfaces and general-purpose search engines.(2) Research on domain classification of Deep WebThe domain classification of Deep Web can achieve more effective organization and management to web resources. Learning the idea of traditional text classification algorithm, combined with the consideration of query interface characteristics, an approach to classify the Deep Web based on domain text was proposed. The major work includes:1) A semantic abstract method using ontology knowledge was proposed, and the concepts, which express the same semantics, are firstly extracted from different texts. The representational ability of domain feature text was enhanced, at the same time to achieve the purpose of effective dimension reduction; 2) The definition of domain correlation is given as the quantitative criteria for domain feature text selection which can avoid the subjectivity and uncertainty of manual selection; 3) In the process of the interface vector construction, an improved weighting method named W-TFIDF is proposed to evaluate the different roles of domain feature text. The experimental results showed that the feature text selected by our approach are accurate and effective, W-TFIDF, the new weighting method, not only has better classification precision than TF and TFIDF, but also maintains higher stability in KNN classification algorithm.(3) Research on form element and label matchingForm element and label matching is an important part of query interface understanding. A vision-based element-label matching approach was proposed in this dissertation. The major work involves the following three aspects:1) A form representation technology named Table-Based Interface Expression (TBIExp) was given, the visual layout of the query interface can be automatically restored by analysis the HTML source code of a query interface; 2) The position relationships and visual features were comprehensively summarized based on statistics and observations, and then, a complete set of heuristic rules was formed; 3) A 3-Round Label Extraction (R3LEX) algorithm was proposed, which comprehensively consideration of<label> tag, text semanteme and position feature. Experiment proves the effectiveness of our method.(4) Research on schema extraction of Deep Web query interfaceSchema extraction of query interface is the foundation for many subsequent works, including schema matching across multiple interfaces properties, unified interface generation and so on. In this dissertation, the relationship of interface properties was viewed as a tree structure, and a bottom-up hierarchical clustering approach was proposed based on TBIExp form reconstruction method. The main work includes:1) Fully examined the factors affecting property group and seven attribute grouping mode were summed up. These grouping modes were sorted in order according to their grouping areas, which provides the basis for the subsequent clustering algorithm; 2) BUCluster, a tree-based hierarchy construction algorithm, was proposed based on TBIExp. As the visual layout information of a query interface has been coded in TBIExp, so our method is more intuitive and accurate than previous works; 3) AttrLEX, a heuristic-based property label extraction algorithm, was proposed based on the schema tree of query interface. Experiments show that the extraction accuracy of our approach has been fully improved compared with previous works.(5) Research on schema matching accross interface propertiesSchema matching is a fundamental and key technology in Deep Web data integration. Focusing on schema matching accross properties coming form multiple query interfaces, we proposed effective solutions to solve the problems of 1:1 simple matching and 1:m complex matching. The major work includes:1) An approach to standardise the interface text based on domain ontology was proposed, which lead to more scientific and accurate in the calculation of text semantic similarity; 2) In the similarity evaluation, we make the full account of the similarities between the meta information of properties, such as semantic similarity, domain similarity and value similarity, which avoid the shortage of simplex relying on semantic similarity in traditional schema matching.3) The quantitative criteria for similarity calculation was given; 4) A neural network-based schema matching model was builded. As neural network can handle non-linear relationship between the input and output well, our method can avoid the subjectivity and uncertainty caused by manual specification of parameters; 5) In the process of 1:m matching, as the use of schema tree of query interface, so our method is not only simple, intuitive, and can obtain satisfactory matching accuracy. Experiments demonstrate the feasibility and effectiveness of our proposed approach.(6) Research on the applications of integration technology in agriculture domainUsing the soil data of Jiangjin in the second national soil survey as example, a "comprehensive soil information search" experimental prototype system was programmed. Given the query condition, the system can access the multiple Web database at the same time. Through the system, the rationality of our proposed approaches was comprehensive examined, on the other hand, the applications of data integration technologies in agricultural domain were discussed.In summary, several key technologies involved in construction of Deep Web integrated query interface were studied in this dissertation. Considering the existing shortages in rule-based identification to Deep Web entries, an approach by using neural network adopting the idea of machine-learning was proposed, which avoid the lack of dynamic adaptability in rule-based method, and can effectively distinguish between the simple query interface and search engines; Due to the existing problems in text-based Deep Web classification, a quantitative criteria for domain feature text selection was defined, which avoid the subjectivity and uncertainty of manual selection, in addition, an improved weighting method of feature text was given, which result in that the vector of query interface is more scientific and accurate, and can improve the classification precision evidently. Considering the existing defects in schema extraction which the vision layout information was not fully used, a vision-base label extraction and matching method was proposed and a tabled-based form representation technology was given, which proved the accuracy of schema extraction and schema matching. For the lack of applications of Deep Web integration technologies in agriculture domain, an "comprehensive soil information search" experimental prototype system was programmed, which undertake beneficial exploration and attempt in acquisition agricultural information resources from the perspective of information technology.
Keywords/Search Tags:Deep Web, data integration, schema extraction, schema matching, agricultural applications
PDF Full Text Request
Related items