Font Size: a A A

Study On Sampling Technology Of Web Database

Posted on:2014-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:J QiaoFull Text:PDF
GTID:2308330473953793Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With rapid development of Web, it has been made a great and complex data source. The Web can be further divided into two parts which are Surface Web and Deep Web, information acquiring in Deep Web needs to access back-end Web databases through their query interfaces, which is limited by query interfaces’ ability. In order to use Deep Web resources efficiently, the researchers integrate data in Deep Web and establish Deep Web data integrating system. Because of the amount of databases in Deep Web, it brings a lot of difficulties. Thus, database’s subject distribution, update frequency, size and other useful characteristics needto be understand in the Deep Web data integrative system. In practice, Web database contains huge amount of information which makes extracting all data of Web database more difficult. As a result, Web database sampling technology arises at the historic moment. Web database sampling technology is to extract data which could represent Web database from Web database by a certain technology.There are many shortcomings in existing Web database sampling methods, mainly reflected in two aspects that one is large sampling cost and the other one is worse sample quality.When sampling, there is a low hate rate. Each time the query results record has a high repetition rate leading to a high sampling cost. There is a worse quality sample and a deviation distributiong between sample database and Web database. Further, the sample database could not instead of Web database to analysis.A new Database sampling model is proposed in this thesis to get a better sample which could use to analysis instead of Web database. In this model, there are two new technologies. One is a query model based on attribute correlation; the other one is a query conditions generation strategy which is based on word frequency and attribute values relevance.By using these two technologies, not only the cost of sampling is reduced, but the quality of samples is improved. After sampling, it could get samples which represent Web Database features.A query model based on attribute correlation means to choose two attributes from query interface attributes, one is classification attribute, the other one is the most irrelevant query attribute compared with the classification attribute. There could be one or more attribute values in query conditions, which could be the same attribute or different attributes. The query model in this thesis is used to limit the number of attributes and attriube values in query conditions. During sampling, each query condition genereates according to the definition of query model. Comapred with the traditional methods, the sampling cost based on Web Database sampling model proposed in this thesis is lower when getting the same number of sample records.A query condition generation strategy is to analyse the current sample database through word frequency and attribute values relevance in order to generate query conditions which meet the query model. Word frequency is able to reflect the state-of-the-art and the development trend of the field that Web Database belongs to, and it is very meaningful for understanding of Web Databases. In order to get a sample database which could represent the Web Database, word frequecy is helpful for increasing the quality of sample database. Attribute values relevance refers to the frequency of two different attribute values appear in the same record, it is used to reduce the number of useless queries namely reducing the sampling cost.Query model based on attribute corrlation and query condition generation strategy through word frequency and attribute values relevance both make contribution to sampling model. When sampling, there are two important points that one is sampling cost and the other one is quality of sample. The improvements in two points have been improved by experiment, and the result illustrates the new Web sampling model in this thesis could get a better sample to instead of Web database.
Keywords/Search Tags:Web database, sampling model, word frequency, attribute value relevance, RF-Sampler
PDF Full Text Request
Related items