Font Size: a A A

Web Wrapper Generation And Adaptive Technology Based On Multi-Feature

Posted on:2022-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y N GuoFull Text:PDF
GTID:2568307049959789Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The Internet is a huge resource pool.The number of web pages has reached one hundred billion levels and it must be higher in the future.The arrival of big data era makes that all kinds of information show an exponential growth trend.Web pages,as an important carrier for storing data on the Internet,contain a lot of information.In order to extract the effective information we need,various Web data extraction technologies have been proposed.The difference between Web data extraction and normal text extraction is mainly reflected in the structure of the web page itself.Web data extraction extracts the semi-structured data of a web page according to certain rules and saves it in a specific format,and then returns it to the user.The main implementation is the web page wrapper.However,there are still some problems with this technology,due to the inherent nature of wrappers and the complexity of the extraction tasks they perform,they are usually strictly related to the structure of the Web pages they handle,and sometimes minor changes to the structure of the Web pages may cause the extraction task to fail.In order to reduce the maintenance cost of the wrapper,we hope that once the wrapper is properly developed,it will work for a long time.Therefore,this thesis puts forward a Web wrapper generation and adaptive technology based on multi-feature.The specific work is as follows:(1)This thesis presents a web page wrapper generation technology based on multi-feature.Taking the page source code analysis of web applications as a breakthrough,this thesis designs feature extraction algorithm to realize the extraction of visual features of web pages based on page rendering and also extract the attributes and structure features of web page labels.Then,this approach locates the target data extraction area and target data items,and finally generate the wrapper configuration file to encapsulate all the information of the feature subset into a specific format to generate the web page wrapper.(2)This thesis presents a web page wrapper adaptive technology based on similarity calculation.First,this method obtains the feature set of the current web page and the information in the original wrapper,calculates its structure similarity to determine if the structure of the web page has changed.If the structure of the web page has changed,it repositions the target area of the wrapper and the target data items,and then obtains the matching of the target area of the new and old versions of the web page and the corresponding relationship of the data items in the target area.Finally,the original web page wrapper is adaptively adjusted according to the resulting pre-version and post-version associations.In order to prove the feasibility and effectiveness of our method,it is evaluated in real web applications.The experimental results show that the method in this thesis can effectively generate web page wrappers with an average accuracy of 95% and a recall rate of 97%,and an adaptive success rate of 87% after updating a web page.
Keywords/Search Tags:Web Data Extraction, Adaptive, Web Wrapper, Similarity Calculation, Web Features
PDF Full Text Request
Related items