Font Size: a A A

Research Of Data Source Selection Of Non-cooperative Structured Deep Web

Posted on:2014-06-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:S DengFull Text:PDF
GTID:1488304316458914Subject:Information management and information systems
Abstract/Summary:PDF Full Text Request
With the constantly expansion of Web, it's very difficult for user to exactly find and query the Web data sources which they really need. In order to efficiently access these data sources, Web data integration system comes into being. Deep Web is a resource collection, which can't be accessed by hyperlinks. Deep Web predominates in the field of Web, in recent years, it is a frontier issue that how to integrate retrieve data in Deep Web effectively. The above problem has been concerned by the researchers from information retrieval field and database field all the time. Deep Web data integration has these Characteristics:the number of data sources is large, autonomous, data is dynamic and irregular. These features present new challenges to the effective application of Deep Web data.There are a lot of accessible data sources in each filed and their interfaces are different, an integrated retrieval system needs to integrate all query interfaces. After having unified integrated interfaces, it is clearly infeasible that submit user queries on the integrated interfaces to each specific data source to retrieve results only with a simple conversion. Because not only it will causes a high price of the query, but also make it hard to ensure the quality of query results. Based on the above reasons, data source selection becomes a key issue of the data integration of Deep Web. Its purpose is to obtain retrieval results which can meet users' requirements, by querying a very small amount of data sources.Deep Web data sources are divided into two types:text data source, structured and semi-structured data source. Generally speaking, the former can be viewed as a file set which includes many Web pages, the latter mainly stores the real-world entities with many attributes. Specially, semi-structured data source mainly stores XML data. Currently, many researches of data source selection are on these two types of data sources. The former mainly brings the mature information retrieval technology into the selection process of text data sources, and judges the availability of a data source base on terms and documents sorting. The latter mainly makes an evaluation on data sources by mining structured feature information from their content.As researches of text data source selection start earlier, it has made a lot of promising research results. In recent years, with the rapid development of commercial Deep Web, more and more people pay more attention to the corresponding structured and semi-structured Deep Web data source selection research. In general, these related researches are still in infancy, principally, there are still many issues to be resolved as follows:(1) During the time of selecting data sources by correlation without considering their own quality, it is easy to put a heavy burden on data integration, such as entity recognition, data fusion, etc.(2) The high-quality research results of existing structured and semi-structured Deep Web data source selection bases on this assumption that data sources are cooperative and they can provide users with index structures and all data in order to build theirs abstract easily. But in fact, it is difficult to establish this hypothetical. Therefore, there is a need to make further researches on how to seize thematic semantic information from sample data to build the corresponding data source summary which can further satisfy query demands. Thematic semantic information includes relationship feature between subject heading and subject heading, relationship feature between subject heading and sub-subject heading, relationship feature between subject heading and feature word.(3) Deep Web data source is updated timely, after updating data source, its summary needs to be adjusted accordingly. However, exsiting studies have not been involved in dynamic summary updated issues.(4) Customers maybe submit hybrid queries, which include search type keywords and constrained type keywords. Search type keywords reflect user's primary query intent, constrained type keywords reflect the constraints on primary query intent. The constrained type keyword is commonly expressed by discrete values. The summary of existing methods for structured and semi-structured Deep Web data source selection haven't considered above query needs.As current structured Deep Web data sources are widely used, this paper focuses on four above aspects about structured Deep Web data source selection, and specific researches are as follows:(1) The evaluation of data sources quality. The key of Data sources quality evaluation is to establish corresponding evaluation models. First, with users' feedback, we gain collections of recommended data sources and refused ones. Second, we analyze and calculate the objective dimensions scores of two collections, and design a core dimensions quality model of data sources, according to the degree of discrimination and the degree of overipping. Thirdly, we establish the quality model by SVM training. Finally, we evaluate this method's performance with multi-domains data. (2) Data source selection for search type keyword query. Firstly, we obtain the representative sample data based on an unbiased sample method of backtracking drill; designing the subject heading access schemes of sample data of data source base on term nature, word frequency, position information, coverage; obtaining the feature words of each subject headings base on subject semantic information; arounding user's needs about data source selection of search type keyword query, we use the relationship between two subject headings, subject heading and feature word to build a corresponding summary in order to deal with data source selection problem. Secondly, we have proposed the subject space selection method and data source evaluation strategy based on above summary. Finally, based on updated relevant of subject headings of data sources in a field, combining sampling techniques, we design a sample-based dynamic summary update algorithm.(3) Data source selection for mixed-type keyword query. After building a summary of data source for query requirement of search type keyword query, in order to implement data source selection for mixed-type keyword query, we add related information of discrete values of feature words's constraint properties to the above summary. Our method effectively summarizes all type attributes, by creating the histogram for discrete values of constraint properties, the association of subject headings and feature words, as well as the association between record distributed histogram. In addition, in light of the characteristics of the histogram association, giving a calculation method of constraint correlation score between histograms, and providing a data source evaluation strategy based on mixed summary.Innovations of this thesis are mainly reflected in the following aspects:(1) Regarding users' feedback as an important means, proposing the field oriented high-quality data source selection method. Existing data source selection methods based on the quality, usually select uniform quality dimensions by researcher's experience, and the accuracy of data source selection in different fields are quite different. According to characteristic data of refused data sources set and recommended data sources set, which got by user feedback, we gain the user recommend credibility and recommendation number of data sources. With above information, we accurately get the members of the refused data sources set and the members of the recommended data sources set. By introducing overlapping degree and difference degree to analyze the dimensional feature of refused data sources set and recommended data sources set, building an evaluation mode of dimension importance, so we can dynamically select different core quality dimensions for data sources in a field. After completion of the above work, it can establish the appropriate quality evaluation models of data sources.(2) Building a subject semantic-based hierarchical summary of non-cooperative structured data source for Deep Web, and present a dynamic update method of summary based on sampling. Take full account of subject semantic information, relationship feature between subject heading and subject heading, relationship feature between subject heading and feature word, relationship feature between subject heading and sub-subject heading, constructing a hierarchical data source summary. This summary not only can effectively characterize contents in data sources, but also reflects inquiry semantics of multiple keywords combination. Then, give the data source selection strategy for search type keyword query base on above summary. In addition, we have designed a calculation method for change rate of subject space. This method can find the update subject headings effectively, and measure the degree of the variation of a subject space accurately. Base on this, it is the first time to propose a sampling-based dynamic summary update method.(3) Mixed summary based on multi-type attributes meets users' mixed types keyword query needs. Through the establishment of association of subject headings, association between subject heading and feature word, and the constraint association between histograms for every two feature words in the same constraint attribute, mixed summary have bean build. Mixed summary can characteristic multi-type attributes efficiently. Finally, we give a data source selection strategy of corresponding keyword query of mixed types, which is based on the degree of search type keywords in data source matching user query and the degree of constraint conditions satisfied user query.
Keywords/Search Tags:Deep Web, Data Source Selection, User Feedback, Subject Semantics, Non-cooperation, Structured
PDF Full Text Request
Related items