Font Size: a A A

Research Of Web Information Extraction Based On Features Of Multiple Pages

Posted on:2018-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:M LiuFull Text:PDF
GTID:2348330515468009Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet,Web has became the world's largest repository of information.Big data technologies provide us with the ability to access large collections of data.Information distribution is an important way for people to obtain information in the Internet 2.0 era.How to extract data from the website pages is very meaningful.HTML is a kind of semi-structured language and most commonly used on Web pages.Rendering the templates with the data accessed from databases is the most common way for creating website pages.This thesis proposes a extraction method with multiple pages based on the theory of page generating by rendering templates.This method can learn the extraction rules from the analyzing the similarity of the sample pages.This thesis also design a framework to collect the sample clusters correctly from the mass pages.This framework also has the ability to adapt the update of the page's structure.This makes the process of the extraction totally automatic.This thesis studies the structure of the Web page and gives the method of merge sample DOM trees.With the variability of the nodes on the merge tree,we can find the content node and extract the rules.This thesis also focuses on the issue of extraction rules failure.Through the improvement of the process of sample page clustering,the extraction rule is adaptive to the change of page structure.The process of extraction is totally automatic.The extraction rules and link generalization results are used to further cluster the pages,so that the refinement of the sample grouping and the adaptive change of the structure are realized.This thesis also design an extraction system based on the extraction framework.This system consists of four modules-the sample gathering module,the rules extraction module,the page extraction module and the center scheduling module.The first three modules can run independently,so that it's easy to deploy these modules on distributed environment.The forth module controls the work flow and the direction of data flow of the first three modules.Each module communicates through network.This kind of design makes the system's high availability and high throughput possible.This system can achieve the average daily throughput of 10 million in the production environment.Especially when process with the news pages,the recall rate and the precision rate can really reach a high level.
Keywords/Search Tags:Web extraction, Page cluster, Automatic, DOM merge
PDF Full Text Request
Related items