Font Size: a A A

Domain-oriented Deep Web Data Automatic Extraction

Posted on:2013-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y DengFull Text:PDF
GTID:2248330377952480Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, the Web contains vastamounts of rich resources which are all-inclusive, and it is a valuable intellectualproperty for human. According to the depth of data stored in Web, Web can be dividedinto Surface Web and Deep Web. Reportedly,99%of Internet data is Deep Web data,and many of them are open for free use. Facing such the huge Internet data, how toaccess and utilize the information in Deep Web effectively and efficiently is becominga very hot research topic in the database field.This paper takes the Deep Web data automatic extraction system as the target,and solves key issues of Deep Web data automatic extraction for particular area, suchas entry finding, query submitting, detailed page positioning and result extracting etc.The issues are as follows:Decision tree-based entry finding: For the issue of Deep Web entry finding, analgorithm using decision tree to generate a valid entry rule is proposed, which judgesthe entry for a particular area. The algorithm can find potential entry rule and avoidthe inherent limitations of the common heuristic rule;Deep Web interaction technique: In Deep Web data extraction, how to interactwith the interface of Deep Web database effectively is the important techniquewhether can extract valid data from Deep Web. This paper makes experiment analysisof existing interaction techniques, and provides reference to selection of differentinteraction techniques;Neighbor matching algorithm based search-orientation: The query pagelocation of Deep Web is overlooked usually. For data extraction, the most studies arebased on response of Deep Web. The response page only provides summary page, sothere is no detailed information. But detail page of Deep Web is a completedinformation page which contains main information of Deep Web theme. This paper uses one of clustering algorithm method, neighbor distance matching algorithm, totrain model, and then locates the query result;Tree matching based page extraction: Although detail page of Deep Web hasunified model, its structure and content are complex. Compared with summary page,the extraction of detail page is more challenging. So a tree matching based approachfor data extraction of detail page, which uses calculation method of term frequency todeal with the noise in the extraction results and makes extraction results richer;This paper does relative experiments for the model and the algorithmsabove-mentioned. The experimental results show that the method proposed in thispaper can solve domain-oriented Deep Web data automatic extraction.
Keywords/Search Tags:Deep Web, automatic extraction, entry finding, page location, resultextraction
PDF Full Text Request
Related items