Font Size: a A A

The Web Hidden Database Extraction And Consistency

Posted on:2017-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y SunFull Text:PDF
GTID:2348330488496348Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As is known to all, the existing search engine technology crawls part of the data of the surface pages from the Internet through hyperlinks. Nowadays,more and more organizations allow public users to access background database by query interface of a Web page. In this case, the existing search engine technology is difficult to crawl data in the backend database effectively. The backend Database is often referred to as Web Hidden Database(Web Hidden Database).The purpose of crawling the hidden database is to analyse the acquired data,integrate and mine the data and provide the related value-added services. Since the record number of query results returned by most of the Web query interface is limited,the main problem during crawling the Web Hidden Database is how to get all the data from the background database at the cost of few page queries and return all the data records to the result page. The study of the problem has become one of the hotspots in web mining area.In this paper,on the basis of the existing methods of Web data extraction,combined with the feature of the Web Hidden Database,we analyse the extraction algorithm of Web Hidden Database and the sub block strategy of entity matching. The main algorithm is studied deeply, and the analysis and experiment are carried out respectively.The main research contents and work of this paper are as follows:1.A Web hidden data extraction model is introduced, in this paper, based on this model we made an in-depth study on the problem based on the numerical attributes,classification and hybrid attributes respectively. The model has good scalability.2.For the problem of extracting the numerical attributes in Web hidden database: first of all,since the traditional binary partition algorithm is difficult to estimate the cost of query, we made improvements and proposed an improved sorting partition algorithm.Seco ndly, we analysed and verifyed the algorithm from one dimension to multi dimension.3.For the problem of extracting the classification attributes: on the basis of classificati on attributes based query decomposition tree, using depth-first search, we proposed a heuristic slice-cover algorithm,thus lowering the cost of query algorithm and i-mproving th e efficiency.4.On the basis of heuristic slice-cover algorithm of the classification attributes and sorting partition algorithm of the numerical attributes, a hybrid extraction algorithm based on hybrid attributes is proposed for the problem of Web Hidden Database extraction with numerical attributes and classification attributes.Thus we could further reduce the query cost in the process of extracting Web hidden database.5. In order to improve the consistency of the data, for the problem of repeated entities in the database,we combined the attribute predicate with the matching function to obtain a better matching strategy for the repeated entity.In this paper,we study how to obtain all the data in the hidden database at the cost of few queries and the consistency of the results data,and provide an effective way to solve the problem.
Keywords/Search Tags:Web Data Extraction, Web Hidden Database, Sorting Algorithm, Slice-Cover Algorithm, Entity Matching
PDF Full Text Request
Related items