Font Size: a A A

Information Extraction Technique For Web Page Based On TPSN-LS And Hadoop

Posted on:2017-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:T T LiFull Text:PDF
GTID:2348330512987471Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,Web has become the world's largest information sources,besides data size continued to show exponential growth trend,data forms are also more diversified,and the time span between webs is very huge.The user's attention of the information at different time will change,so how to quickly and accurately extract valuable information from the massive data has become a difficult problem to be solved.Traditional information extraction technology is suitable for small scale at specific areas,and not considered the influence of “time span factor” on the accuracy of information extraction.This topic from the perspective of information extraction of time synchronization,combing distributed computing framework and storage system,Web information extraction technology is researched.Through in-depth study the Hadoop MapReduce programming model and distributed file system HDFS,combined with complex network time synchronization mechanism,using the DOM tree path and template combination of extraction rules of Web information extraction,using TPSN-LS algorithm to optimize the extracting web data synchronization time deviation,a modified page complex network construc tion algorithm is proposed in this paper,and web information extraction system is realized in the Hadoop platform finally.The main work is as follows:(1)Combined with the complex network theory and the characteristics of web pages to study the structure of complex networks,oriented information extraction requirements to improve and redefine complex network related parameters,finally the implementation process of the time synchronization of Web information extraction is presented.(2)Convert HTML page into XHTML document,and parse it into form DOM tree complex networks to study on information extraction algorithm.DOM tree Web information extraction hierarchy model is proposed with applied TPSN-LS algorithm.In this paper,the main research contents of Web information extraction include page preprocessing,structure of DOM tree complex network,location data area and target data extraction.(3)Using Hadoop to realize efficient and parallel information extraction,designing information extraction system function module,extraction process and HDFS data storage physical structure,and then the information extraction system for parallel implementation of MapReduce.Finally,different data sizes were conducted multi-node experiment on the Hadoop platform,experimental results show that this system has good accuracy,high efficiency and scalability.
Keywords/Search Tags:Web Information Extraction, Hadoop, Complex Network, TPSN
PDF Full Text Request
Related items