Font Size: a A A

Research On Web Page Content Extraction Based On Hadoop

Posted on:2018-06-17Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2348330536979660Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and the increasing number of Internet users,the amount of information on the web has been increasing.Web information extraction has become one of the current research hotspots.The current Web information is an important source of network users to obtain information,due to the dynamic change of Web information,users can not quickly capture the text information in a large number of network information library.How to filter the noise in the page quickly and accurately from the huge Internet resource database,and extract the useful information to users in the web page.In this paper,the method of Web page text extraction based on Hadoop is one of the methods to solve the above issues.In this paper,we study how to ensure the efficiency and accuracy of Web page text extraction in the face of massive scale data Web pages.The research content mainly includes two parts: In the first part,this paper analyzes the existing block method based on visual information,improves the original algorithm and generate a more complete web block.In the second part,this paper making full use of the style,content,word frequency and other characteristics of the web block analysis content block according to the important degree.On the basis of the research content of this paper,the characteristics of typical system structure are analyzed,and the Web page text extraction system based on Hadoop is designed and implemented.The test results show that the proposed algorithm has good accuracy and high performance.This system can solve the problem of massive web page extraction.In this paper,the extraction method based on Hadoop provides a new way to solve the problem of massive data model.
Keywords/Search Tags:Web Information Extraction, mass data, page segmentation, Importance degree, Hadoop
PDF Full Text Request
Related items