Font Size: a A A

Research On Web Data Extraction Based On Web Page Structure

Posted on:2017-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:R HuFull Text:PDF
GTID:2308330485464130Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Since the 1950s the world’s first computer come out, especially in recent decades with the rapid development of Internet technology, on the Internet from around the world in the field of data showing exponential growth. Our daily lives have long been closely linked with these vast amounts of data. Human needs for information reached an unprecedented height. Data on the Internet as the main course is Web data, but due to the inherent semi-structured web pages plus restrictions everywhere with irrelevant advertising and other information relating to noise information, which makes us very hard to acquired and use the information of interest from vast amounts of web data. So how accurate and convenient extract the target data information from the mass of data on the Internet and stored in a structured.lt is becoming increasingly important.Current in this area has become one of the hot research among domestic and foreign scholars. Majority of the research is based on the page DOM-Tree or a visual tree, then manual or semi-manual also has automated extraction technology through a number of methods such as design the Wrapper, many studies have also achieved good results. Our study is built on the DOM tree structure too, and the study is just based on the List_Page.The entire extraction process is divided into the target area location and location and identification data records in two steps.In the research of achieving the targets area of information,we fist do a optimized processing on the HTML parsed into DOM tree.And put forward an leaf node path optimized algorithm based on XPath.The output of this algorithm is a leaf node path of DOM tree.And this path structure is the key of follow-up work.In the basis of these job we introduced intermediary mathematical theory (MMTD) and specific to structural features of the DOM tree proposed "Data region Location by MMTD(DL_MMTD)".This mathematical method to quantization fuzzy world has used in many areas of computer science, especially the fuzzy set processing. But This is the first time of the intermediary mathematical theory applied to study up web information extraction, and achieved good results.Then research on the extraction algorithm of data record.For this we introduce the concept of data record length.And accordingly proposed "Count Data Record Length by Path Structure algorithm(CDL_PathStructure)". On the basis of obtained the data record length in each data record,we extracted the data units sequentially and composite every data record by the data record length.
Keywords/Search Tags:DOM-Tree, MMTD, Target data area, List-Page
PDF Full Text Request
Related items