The Web Hidden Database Extraction And Consistency

Posted on:2017-02-15

Degree:Master

Type:Thesis

Country:China

Candidate:Y Sun

Full Text:PDF

GTID:2348330488496348

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

As is known to all, the existing search engine technology crawls part of the data of the surface pages from the Internet through hyperlinks. Nowadays,more and more organizations allow public users to access background database by query interface of a Web page. In this case, the existing search engine technology is difficult to crawl data in the backend database effectively. The backend Database is often referred to as Web Hidden Database(Web Hidden Database).The purpose of crawling the hidden database is to analyse the acquired data,integrate and mine the data and provide the related value-added services. Since the record number of query results returned by most of the Web query interface is limited,the main problem during crawling the Web Hidden Database is how to get all the data from the background database at the cost of few page queries and return all the data records to the result page. The study of the problem has become one of the hotspots in web mining area.In this paper,on the basis of the existing methods of Web data extraction,combined with the feature of the Web Hidden Database,we analyse the extraction algorithm of Web Hidden Database and the sub block strategy of entity matching. The main algorithm is studied deeply, and the analysis and experiment are carried out respectively.The main research contents and work of this paper are as follows:1.A Web hidden data extraction model is introduced, in this paper, based on this model we made an in-depth study on the problem based on the numerical attributes,classification and hybrid attributes respectively. The model has good scalability.2.For the problem of extracting the numerical attributes in Web hidden database: first of all,since the traditional binary partition algorithm is difficult to estimate the cost of query, we made improvements and proposed an improved sorting partition algorithm.Seco ndly, we analysed and verifyed the algorithm from one dimension to multi dimension.3.For the problem of extracting the classification attributes: on the basis of classificati on attributes based query decomposition tree, using depth-first search, we proposed a heuristic slice-cover algorithm,thus lowering the cost of query algorithm and i-mproving th e efficiency.4.On the basis of heuristic slice-cover algorithm of the classification attributes and sorting partition algorithm of the numerical attributes, a hybrid extraction algorithm based on hybrid attributes is proposed for the problem of Web Hidden Database extraction with numerical attributes and classification attributes.Thus we could further reduce the query cost in the process of extracting Web hidden database.5. In order to improve the consistency of the data, for the problem of repeated entities in the database,we combined the attribute predicate with the matching function to obtain a better matching strategy for the repeated entity.In this paper,we study how to obtain all the data in the hidden database at the cost of few queries and the consistency of the results data,and provide an effective way to solve the problem.

Keywords/Search Tags:

Web Data Extraction, Web Hidden Database, Sorting Algorithm, Slice-Cover Algorithm, Entity Matching

PDF Full Text Request

Related items

1	Research And Application Of Named Entity Recognition Method For The Bidding Data
2	PSO Main Ridge Slice Of Fuzzy Search Function Method And Sorting Of The
3	Research On The Techniques Of Entity Identity On XML Data
4	Shou Guang Talent Recruitment Website Cover In The Retrieval Algorithm
5	Research On Radar Signal Sorting Using Multi-dimension Information Features
6	Research On Entity Linking Algorithm By Combining The Attention Mechanism And Hidden Semantic Information
7	Study On Chinese Named Entity Recognition Based On Hidden Markov Model
8	A Study On Land Cover Feature Extraction And Classification Using High Dimensional Remote Sensing Data
9	Rsearch On Program Slice Technology And Slice Scheme Design
10	Research On The Image Matching Algorithm Of LED Chip Check And Sorting