The Research Of Extraction Methods Of Websites Data

Posted on:2017-07-15

Degree:Master

Type:Thesis

Country:China

Candidate:Y Hu

Full Text:PDF

GTID:2428330485473689

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

A great change of internet has occurred since the emergence of web 2.0.Everyone who is online can participate in the Internet,and easily publish information,which leads to a lot of spam generated.Due to some commercial and technical issues,Search engine is not a good solution to the problem that getting access to information of user's interest vertically and efficiently.While more and more web pages which get dynamic data by AJAX,the search engine can't have a good analysis for dynamic web page.So how to get the data from dynamic web pages vertically,it has some significance.This article has done some research and experiments on several information getting application.First,choose web apis way to get data,using Baidu PM2.5 api to get PM2.5 information of Wuhan,after studying the api document and process,get the WuHan PM2.5 information successfully.Second,using RSS way to get news information of website,the target is general news channel of ifeng.com.After parsing RSS,get the news title links as a list.Finally,do some experiments on search engine of baidu,select some keywords that I am interested in and the search engine can't give good search results in normal life.After evaluating the above three methods,reading some data crawling literature and analysis of feasibility,proposing a information data crawling vertically by DOM semi-automated system based on wrappers.The core module of the system is phantomJS package,based on B/S architecture,the experiment targets are jd.com,suning.com and amazon.cn,the data of jd and suning are generated dynamically,the data in amazon can be found in web page source,they represent the main ways which data are generated in internet.By giving a URL of web page to the system,it can get the data directly and successfully from these three websites.So the system achieve the goal,but from the view of experiments process,due to the need of running phantomjs in system,It is like open a browser to parse the web page,it takes some time to crawl the data,while it needs good performance of server.

Keywords/Search Tags:

wrappers, DOM, dynamic data, phantomJS

PDF Full Text Request

Related items

1	Design And Implementation Of Web Front-end Page Monitoring Platform Based On PhantomJS
2	Research Of Multiple Wrappers Information Extraction System Based On Tree Model
3	Design And Implementation Of Component Wrappers In The Heterogeneous Component Composition Model
4	Key Technology Research Of User Influence In Social Network
5	Research And Implementation Of The Web Services Model With ETR Technology
6	Research About The Selective Naive Bayesian Classification Based On Weighted Attributes
7	The Design And Implementation Of Coverage Analysis Module Based On TCT Test Suite
8	Domain-Oriented Web Entity Expansion And Robust Optimization Of The Wrapper
9	Automatically constructing wrappers for effective and efficient Web information extraction
10	Hyperspectral Image Band Selection Based On Local Spatial Information