Domain-oriented Deep Web Data Automatic Extraction

Posted on:2013-05-31

Degree:Master

Type:Thesis

Country:China

Candidate:Y Deng

Full Text:PDF

GTID:2248330377952480

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology, the Web contains vastamounts of rich resources which are all-inclusive, and it is a valuable intellectualproperty for human. According to the depth of data stored in Web, Web can be dividedinto Surface Web and Deep Web. Reportedly,99%of Internet data is Deep Web data,and many of them are open for free use. Facing such the huge Internet data, how toaccess and utilize the information in Deep Web effectively and efficiently is becominga very hot research topic in the database field.This paper takes the Deep Web data automatic extraction system as the target,and solves key issues of Deep Web data automatic extraction for particular area, suchas entry finding, query submitting, detailed page positioning and result extracting etc.The issues are as follows:Decision tree-based entry finding: For the issue of Deep Web entry finding, analgorithm using decision tree to generate a valid entry rule is proposed, which judgesthe entry for a particular area. The algorithm can find potential entry rule and avoidthe inherent limitations of the common heuristic rule;Deep Web interaction technique: In Deep Web data extraction, how to interactwith the interface of Deep Web database effectively is the important techniquewhether can extract valid data from Deep Web. This paper makes experiment analysisof existing interaction techniques, and provides reference to selection of differentinteraction techniques;Neighbor matching algorithm based search-orientation: The query pagelocation of Deep Web is overlooked usually. For data extraction, the most studies arebased on response of Deep Web. The response page only provides summary page, sothere is no detailed information. But detail page of Deep Web is a completedinformation page which contains main information of Deep Web theme. This paper uses one of clustering algorithm method, neighbor distance matching algorithm, totrain model, and then locates the query result;Tree matching based page extraction: Although detail page of Deep Web hasunified model, its structure and content are complex. Compared with summary page,the extraction of detail page is more challenging. So a tree matching based approachfor data extraction of detail page, which uses calculation method of term frequency todeal with the noise in the extraction results and makes extraction results richer;This paper does relative experiments for the model and the algorithmsabove-mentioned. The experimental results show that the method proposed in thispaper can solve domain-oriented Deep Web data automatic extraction.

Keywords/Search Tags:

Deep Web, automatic extraction, entry finding, page location, resultextraction

PDF Full Text Request

Related items

1	The Design And Implement Of Web Page Automatic Categorization And Storage Management System
2	Image-based Form Recognition Algorithm And Automatic Entry System
3	Research On Web Page Classification And Information Collection
4	Software Platform Development Of Accurate Indoor Location Finding System Based On Wireless Local Area Network
5	Research On Web Article Automatic Extraction Method Based On Page Segmentation
6	Research And Implementation Of The Establishment Method Of Science And Technology Entry Database
7	The Method For Extracting Side Page Of 3D Book Model
8	Research On Location Finding In Cellular Networks Based On The Third Generation Mobile Communication System
9	Entry And Exit Management System Based On J2EE
10	Multiple Target Direction Finding Cross Location Algorithm Research Quickly