Research On Web Page Content Extraction Based On Hadoop

Posted on:2018-06-17

Degree:Master

Type:Thesis

Country:China

Candidate:J Wang

Full Text:PDF

GTID:2348330536979660

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology and the increasing number of Internet users,the amount of information on the web has been increasing.Web information extraction has become one of the current research hotspots.The current Web information is an important source of network users to obtain information,due to the dynamic change of Web information,users can not quickly capture the text information in a large number of network information library.How to filter the noise in the page quickly and accurately from the huge Internet resource database,and extract the useful information to users in the web page.In this paper,the method of Web page text extraction based on Hadoop is one of the methods to solve the above issues.In this paper,we study how to ensure the efficiency and accuracy of Web page text extraction in the face of massive scale data Web pages.The research content mainly includes two parts: In the first part,this paper analyzes the existing block method based on visual information,improves the original algorithm and generate a more complete web block.In the second part,this paper making full use of the style,content,word frequency and other characteristics of the web block analysis content block according to the important degree.On the basis of the research content of this paper,the characteristics of typical system structure are analyzed,and the Web page text extraction system based on Hadoop is designed and implemented.The test results show that the proposed algorithm has good accuracy and high performance.This system can solve the problem of massive web page extraction.In this paper,the extraction method based on Hadoop provides a new way to solve the problem of massive data model.

Keywords/Search Tags:

Web Information Extraction, mass data, page segmentation, Importance degree, Hadoop

PDF Full Text Request

Related items

1	Design And Implementation Of The Mass Data Analysis System Based On Hadoop
2	Web Page-oriented Handheld Devices Automatically Cutting Technology Research
3	Research On Mining Structure Of WEB Page For Information Extraction
4	Research On Web Article Automatic Extraction Method Based On Page Segmentation
5	The Design And Implementation Of Internet News Reading System Based On Hadoop
6	Information Extraction Technique For Web Page Based On TPSN-LS And Hadoop
7	A Study On Methods Of Web Page Topical Information Extraction
8	Research On Extraction Of Web Data Entities Based On Domain Features
9	Web Page Importance Ranking With Priori Knowledge
10	Research And Implementation Of Web Page Segmentation Algorithm MFPS Based On Multi-Feature