Research On Web Data Extraction Based On Web Page Structure

Posted on:2017-04-13

Degree:Master

Type:Thesis

Country:China

Candidate:R Hu

Full Text:PDF

GTID:2308330485464130

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Since the 1950s the worldâ€™s first computer come out, especially in recent decades with the rapid development of Internet technology, on the Internet from around the world in the field of data showing exponential growth. Our daily lives have long been closely linked with these vast amounts of data. Human needs for information reached an unprecedented height. Data on the Internet as the main course is Web data, but due to the inherent semi-structured web pages plus restrictions everywhere with irrelevant advertising and other information relating to noise information, which makes us very hard to acquired and use the information of interest from vast amounts of web data. So how accurate and convenient extract the target data information from the mass of data on the Internet and stored in a structured.lt is becoming increasingly important.Current in this area has become one of the hot research among domestic and foreign scholars. Majority of the research is based on the page DOM-Tree or a visual tree, then manual or semi-manual also has automated extraction technology through a number of methods such as design the Wrapper, many studies have also achieved good results. Our study is built on the DOM tree structure too, and the study is just based on the List_Page.The entire extraction process is divided into the target area location and location and identification data records in two steps.In the research of achieving the targets area of information,we fist do a optimized processing on the HTML parsed into DOM tree.And put forward an leaf node path optimized algorithm based on XPath.The output of this algorithm is a leaf node path of DOM tree.And this path structure is the key of follow-up work.In the basis of these job we introduced intermediary mathematical theory (MMTD) and specific to structural features of the DOM tree proposed "Data region Location by MMTD(DL_MMTD)".This mathematical method to quantization fuzzy world has used in many areas of computer science, especially the fuzzy set processing. But This is the first time of the intermediary mathematical theory applied to study up web information extraction, and achieved good results.Then research on the extraction algorithm of data record.For this we introduce the concept of data record length.And accordingly proposed "Count Data Record Length by Path Structure algorithm(CDL_PathStructure)". On the basis of obtained the data record length in each data record,we extracted the data units sequentially and composite every data record by the data record length.

Keywords/Search Tags:

DOM-Tree, MMTD, Target data area, List-Page

PDF Full Text Request

Related items

1	Research Of Data Extraction Technology Based On Tag Tree From List Pages
2	Research On Data Extraction Of Deep Web Based On Visual Information And Tree Match
3	The Research And Implementation Of One Kind Of Web Page Filtering Method Based On Real-Time Network Traffic Data
4	Study On Automatic Extraction Of Web Data Based On DOM
5	Research On Big Data Quality Evaluation Based On MMTD
6	Research And Implementation Of Chinese Web-page Classification Based On Web Data-mining
7	Structure Information Extraction- Study And Implementation On Semi-auto Wrapper
8	Research On Mining Structure Of WEB Page For Information Extraction
9	The Research On Tracking Area List Management In LTE System
10	Research Of High-Dimension Data Stream Clustering Algorithm Based On Damped Window And Pruning List Tree