Research Of Web Information Extraction Based On Features Of Multiple Pages

Posted on:2018-08-23

Degree:Master

Type:Thesis

Country:China

Candidate:M Liu

Full Text:PDF

GTID:2348330515468009

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of the Internet,Web has became the world's largest repository of information.Big data technologies provide us with the ability to access large collections of data.Information distribution is an important way for people to obtain information in the Internet 2.0 era.How to extract data from the website pages is very meaningful.HTML is a kind of semi-structured language and most commonly used on Web pages.Rendering the templates with the data accessed from databases is the most common way for creating website pages.This thesis proposes a extraction method with multiple pages based on the theory of page generating by rendering templates.This method can learn the extraction rules from the analyzing the similarity of the sample pages.This thesis also design a framework to collect the sample clusters correctly from the mass pages.This framework also has the ability to adapt the update of the page's structure.This makes the process of the extraction totally automatic.This thesis studies the structure of the Web page and gives the method of merge sample DOM trees.With the variability of the nodes on the merge tree,we can find the content node and extract the rules.This thesis also focuses on the issue of extraction rules failure.Through the improvement of the process of sample page clustering,the extraction rule is adaptive to the change of page structure.The process of extraction is totally automatic.The extraction rules and link generalization results are used to further cluster the pages,so that the refinement of the sample grouping and the adaptive change of the structure are realized.This thesis also design an extraction system based on the extraction framework.This system consists of four modules-the sample gathering module,the rules extraction module,the page extraction module and the center scheduling module.The first three modules can run independently,so that it's easy to deploy these modules on distributed environment.The forth module controls the work flow and the direction of data flow of the first three modules.Each module communicates through network.This kind of design makes the system's high availability and high throughput possible.This system can achieve the average daily throughput of 10 million in the production environment.Especially when process with the news pages,the recall rate and the precision rate can really reach a high level.

Keywords/Search Tags:

Web extraction, Page cluster, Automatic, DOM merge

PDF Full Text Request

Related items

1	The Design And Implement Of Web Page Automatic Categorization And Storage Management System
2	Research Of Cluster-Merge-Based Protocol For Wireless Sensor Network
3	Research On Web Page Classification And Information Collection
4	Research On Web Article Automatic Extraction Method Based On Page Segmentation
5	Research, Simulation And Imuplementation Of The Strategy Of Cluster Merge And Split N HP2P Network
6	Research, Simulation And Imuplementation Of The Strategy Of Cluster Merge And Split N Hp2p Network
7	The Method For Extracting Side Page Of 3D Book Model
8	Domain-oriented Deep Web Data Automatic Extraction
9	Research In Automatic Locating Technologies Of Web Page Objects Based On Multi-features
10	Research On Multi-page Special Web Page Text Extraction And Merging Technology