Research And Construction On Data Acquisition Model Of The Tourism Information Based On Hadoop Cloud Computing

Posted on:2018-01-06

Degree:Master

Type:Thesis

Country:China

Candidate:P F Yang

Full Text:PDF

GTID:2428330620957785

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid popularity and development of Internet,it has gone into each aspect of daily life.The Internet has become an important carrier for people to social contact and search information.At the same time,people share the tourist track,scenic spot,evaluation and feeling on the Internet.With the increasing number of tourists and the share of tourist information,the scenic area websites and tourism websites are developing rapidly,providing more rich tourism information for users.Faced with the era of big data,it closely combined with tourism,how to accurately collect the complex tourism information on the Internet to solve the relative shortage of tourism information data is the current core issues.There are various types of data on the tourism information website,such as structural text,half-structural Web pages,non-structural text and videos etc.The structure of Web pages is heterogeneously,it contains the noise data which is irrelevant to the tourism information.Therefore,how to exact the information precisely from Web pages without the perturbation of noise data,restore the large scale data as well as get the required tourism information are major issue for urgent solution.According to the issues mentioned above,the paper proposes an integration schemes that a model of data acquisition based on Hadoop and cloud computing,it contains three parts: the collection of tourism information,the exaction of information from tourism Web pages and message retrieval.The paper realizes the first part based on the secondary development of Webcollector which is open source Web Crawler framework,filters reduplicate URLs and resolve bottle-neck of DNS by Bloom Filter and multi-thread respectively,and uses the strategy of breath-first traversal etc.The collected data of tourism information will be restored in HBase and HDFS.The paper designs the MapReduce parallel LCS(Longest Common Subsequence)algorithm to solve elimination of similar web pages,which can effectively reduce the duplication of tourism information collection.In order to solve the problem of precise extraction of tourism web page information,the paper designs a combination of label path feature fusion and the blocking DOM tree.The message retrieval is based on Lucene,we build the indexes for HBase by Lucene.Then,we provide the collected tourism information to users through full-text retrieval.

Keywords/Search Tags:

Hadoop, Cloud Computing, Data Acquisition System, Lucene, Combination Extraction Method

PDF Full Text Request

Related items

1	Research And Implementation Of Resource Monitoring Technology Based On Cloud Computing Platform
2	The Design Of The Cloud Computing System Based On Hadoop
3	Hadoop Method And System Design Of Massive Sensor Information Processing Based On Service Platform Of The Internet Of Things
4	Research On Security Mechanism Of Cloud Computing Based On Hadoop
5	Research And Application Of Cloud Computing Technology In The Power System Bad Data Processing
6	Research On Security Key Issues Of Cloud Computing Based On Hadoop
7	Design And Implementation Of The Online Shopping System Based On Hadoop Cloud Computing Framework
8	A Research Of Image Retriveal Based On Lucene On The Cloud Computing Platform
9	Key Technology Study On The Cloud Computing Platform In The Field Of Search Engine
10	Study On Linear Finite Element Method In Cloud Computing