Font Size: a A A

The Research Of Extraction Methods Of Websites Data

Posted on:2017-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuFull Text:PDF
GTID:2428330485473689Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
A great change of internet has occurred since the emergence of web 2.0.Everyone who is online can participate in the Internet,and easily publish information,which leads to a lot of spam generated.Due to some commercial and technical issues,Search engine is not a good solution to the problem that getting access to information of user's interest vertically and efficiently.While more and more web pages which get dynamic data by AJAX,the search engine can't have a good analysis for dynamic web page.So how to get the data from dynamic web pages vertically,it has some significance.This article has done some research and experiments on several information getting application.First,choose web apis way to get data,using Baidu PM2.5 api to get PM2.5 information of Wuhan,after studying the api document and process,get the WuHan PM2.5 information successfully.Second,using RSS way to get news information of website,the target is general news channel of ifeng.com.After parsing RSS,get the news title links as a list.Finally,do some experiments on search engine of baidu,select some keywords that I am interested in and the search engine can't give good search results in normal life.After evaluating the above three methods,reading some data crawling literature and analysis of feasibility,proposing a information data crawling vertically by DOM semi-automated system based on wrappers.The core module of the system is phantomJS package,based on B/S architecture,the experiment targets are jd.com,suning.com and amazon.cn,the data of jd and suning are generated dynamically,the data in amazon can be found in web page source,they represent the main ways which data are generated in internet.By giving a URL of web page to the system,it can get the data directly and successfully from these three websites.So the system achieve the goal,but from the view of experiments process,due to the need of running phantomjs in system,It is like open a browser to parse the web page,it takes some time to crawl the data,while it needs good performance of server.
Keywords/Search Tags:wrappers, DOM, dynamic data, phantomJS
PDF Full Text Request
Related items