Font Size: a A A

Approach On Vison Based Deep Web Data Extraction

Posted on:2015-11-20Degree:MasterType:Thesis
Country:ChinaCandidate:F Z TanFull Text:PDF
GTID:2298330431964352Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Recently, Network technology has become more and more popular.With itā€™sdevelopment, the web has became a huge resources with massive valuable data. Nowmany applications,such as market intelligence analysis,are in an urgent need to minethese data for obtaining useful information,and then the greatest degree of auxiliarydecisions. However, web data has features such as large scale, heterogeneous,autonomous, distributed, etc, which makes the analysis of web data mining hasbecome particularly difficult. It is imperative to integrate them to provide high-qualitydata mining analysis. According to web information inherent in the "depth", the webis composed of Deep Web and Surface Web. Deep Web data is far exceeds the SurfaceWeb on the quantity and quality, and has higher value. So, how to extract Deep Webdata efficiently for effective analysis has important practical significance and broadapplication prospects.Information on various sites on the Internet are independent,So,it is hard tocomplete Deep Web data collection. In this case, the usual search engines play anegligible role in data mining. Writing rules by hand to complete the informationextraction has low technical threshold,though high accuracy. But for thediversification of information resources and potential revision risk, the manual way ocan not meet the needs of people access to information. Combined with the abovebackground, we can see that the implemention of web information automaticextraction technology is in a very urgent need to address the problem. To solve thisquestion, this paper do some in-depth and systematic research on Deep Webinformation extraction automatically technology, including vision-based webinformation, machine learning training model, Deep Web information automaticallyextracted, and other aspects of the alignment of data items, and develop the system ofWeb information extraction automatical system.In this paper, specific research workand research results are as follows:(1) Based on visual features, getting a visual-block tree through splitting webpages, and then based on the visual-block tree, integrating the visual attributesthat the data region positionning needed, getting the machine learning trainingset. (2) Using effective tool of training machine learning,combining mannua rulesto remove duplicate and noising informaition, accurately complete the DeepWeb data region location.(3) Proposed effective alignment rules to improve the alignment accuracy of thedata item.(4) Based on the above research, develop the Deep Web information automaticallyextraction system, system implementation features include:1)web page visualtree transformation;2)data region automatic position;3) data items completelyextraction;4) generating Wrapper;5) Auto flip function completionAchieved show that the proposed technical approach can extrac rich list pagesdata basically with no human intervention and quickly and automatically.
Keywords/Search Tags:deep web, data extraction, visual features, machine learning
PDF Full Text Request
Related items