Font Size: a A A

Research Of Data Source Selection With Similar Theme In Deep Web Integrated System

Posted on:2012-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y B SangFull Text:PDF
GTID:2178330338997359Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The way that people get high quality data which is hidden in the database by using Deep Web integrated system is one of important ways for people to obtain information. Deep Web integrated system needs to get information from mass of data sources on the web when querying data, so the more of data sources on the web, the more cost that system obtains information, the quality of these data sources is uneven at same time, and there is shortage that it can't be efficient for user to get high-quality data from the data sources.This paper presents a selection method of Deep Web data sources with similar theme, which is based on study of using data source quality indicator to select high-quality data sources. This method can effectively compute the repeatability of the content between new data source and integrated system by differential analysis of the data source, and uses quality indicators of accuracy, sequence, size of data source and authority that represent the characteristics of database to assess the quality of data source from different perspectives.The main contents of this paper can be summarized as follows:①Discusses and analyzes the research of Deep Web technology, research status at home and abroad, practical significance, domain knowledge and related technologies of Deep Web integrated framework.②It can obtain repeatability of three or more data sources with similar themes by using improved data sources estimation methods. Firstly, the paper chooses key attributes set for the record of data source, then use edit distance method to realize the comparison on value of corresponding attribute between single data source records with records of similar data source in integrated system, finally uses FR (Frequent Records) method to obtain the contents repeatability of single data source and data source sets in the integrated system.③It improves correlation discrimination method of record in the query result set. This method gets the frequency of record in the data source set with similar theme by the way of probing query, and the record is related to the query when the frequency of it exceeds a given threshold. This method can obtain a different number of relevant records with the diversification of threshold value, and eliminates the influence about type restrictions of property in the query interface of traditional correlation discrimination method of record.④For current deep web data source quality assessment exists shortages of poor objective truth and low accuracy , this paper uses quality indicators of accuracy, sequence, size of data source and authority to get the quality of Deep Web data source by establishing data source quality assessment model, so as to select the N-Deep Web data sources of highest quality for user query.The experiment results on mainstream book sites show that the proposed method not only can reduce the burden on the system, but also can assess quality of data sources with same theme effectively, then the system can obtain higher quality data sources.
Keywords/Search Tags:Repeatability Estimation, Quality Indicators, Deep Web, Data Source Selection, Quality Evaluation Model
PDF Full Text Request
Related items