Information Extraction Technique For Web Page Based On TPSN-LS And Hadoop

Posted on:2017-01-11

Degree:Master

Type:Thesis

Country:China

Candidate:T T Li

Full Text:PDF

GTID:2348330512987471

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,Web has become the world's largest information sources,besides data size continued to show exponential growth trend,data forms are also more diversified,and the time span between webs is very huge.The user's attention of the information at different time will change,so how to quickly and accurately extract valuable information from the massive data has become a difficult problem to be solved.Traditional information extraction technology is suitable for small scale at specific areas,and not considered the influence of �time span factor� on the accuracy of information extraction.This topic from the perspective of information extraction of time synchronization,combing distributed computing framework and storage system,Web information extraction technology is researched.Through in-depth study the Hadoop MapReduce programming model and distributed file system HDFS,combined with complex network time synchronization mechanism,using the DOM tree path and template combination of extraction rules of Web information extraction,using TPSN-LS algorithm to optimize the extracting web data synchronization time deviation,a modified page complex network construc tion algorithm is proposed in this paper,and web information extraction system is realized in the Hadoop platform finally.The main work is as follows:(1)Combined with the complex network theory and the characteristics of web pages to study the structure of complex networks,oriented information extraction requirements to improve and redefine complex network related parameters,finally the implementation process of the time synchronization of Web information extraction is presented.(2)Convert HTML page into XHTML document,and parse it into form DOM tree complex networks to study on information extraction algorithm.DOM tree Web information extraction hierarchy model is proposed with applied TPSN-LS algorithm.In this paper,the main research contents of Web information extraction include page preprocessing,structure of DOM tree complex network,location data area and target data extraction.(3)Using Hadoop to realize efficient and parallel information extraction,designing information extraction system function module,extraction process and HDFS data storage physical structure,and then the information extraction system for parallel implementation of MapReduce.Finally,different data sizes were conducted multi-node experiment on the Hadoop platform,experimental results show that this system has good accuracy,high efficiency and scalability.

Keywords/Search Tags:

Web Information Extraction, Hadoop, Complex Network, TPSN

PDF Full Text Request

Related items

1	Time Synchronization Algorithm For Spider System Based On Hadoop
2	Research And Implementation On Automatic Construction Of Complex Network Based On The Technology Of Information Extraction
3	An Improved TPSN Algorithm With Conflict Detection For Time Synchronization In Wireless Sensor Network
4	Research On Web Page Content Extraction Based On Hadoop
5	The Analysis And Research Of Improved Time Synchronization Algorithm TPSN For Wireless Sensor Network
6	Research On Information Retrieval And Public Opinion Detection Algorithm Based On Hadoop
7	Research On Extraction Of The Backbone In Information Recommender Network
8	The Applied Research Of Complex Networks In Processing Of Web News Information
9	The Method Of Extracting Complex Indicators From Long Text
10	Based On The Hadoop Web Information Extraction And The Research And Implementation Of Spam Filtering