Font Size: a A A

Research And Construction On Data Acquisition Model Of The Tourism Information Based On Hadoop Cloud Computing

Posted on:2018-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:P F YangFull Text:PDF
GTID:2428330620957785Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid popularity and development of Internet,it has gone into each aspect of daily life.The Internet has become an important carrier for people to social contact and search information.At the same time,people share the tourist track,scenic spot,evaluation and feeling on the Internet.With the increasing number of tourists and the share of tourist information,the scenic area websites and tourism websites are developing rapidly,providing more rich tourism information for users.Faced with the era of big data,it closely combined with tourism,how to accurately collect the complex tourism information on the Internet to solve the relative shortage of tourism information data is the current core issues.There are various types of data on the tourism information website,such as structural text,half-structural Web pages,non-structural text and videos etc.The structure of Web pages is heterogeneously,it contains the noise data which is irrelevant to the tourism information.Therefore,how to exact the information precisely from Web pages without the perturbation of noise data,restore the large scale data as well as get the required tourism information are major issue for urgent solution.According to the issues mentioned above,the paper proposes an integration schemes that a model of data acquisition based on Hadoop and cloud computing,it contains three parts: the collection of tourism information,the exaction of information from tourism Web pages and message retrieval.The paper realizes the first part based on the secondary development of Webcollector which is open source Web Crawler framework,filters reduplicate URLs and resolve bottle-neck of DNS by Bloom Filter and multi-thread respectively,and uses the strategy of breath-first traversal etc.The collected data of tourism information will be restored in HBase and HDFS.The paper designs the MapReduce parallel LCS(Longest Common Subsequence)algorithm to solve elimination of similar web pages,which can effectively reduce the duplication of tourism information collection.In order to solve the problem of precise extraction of tourism web page information,the paper designs a combination of label path feature fusion and the blocking DOM tree.The message retrieval is based on Lucene,we build the indexes for HBase by Lucene.Then,we provide the collected tourism information to users through full-text retrieval.
Keywords/Search Tags:Hadoop, Cloud Computing, Data Acquisition System, Lucene, Combination Extraction Method
PDF Full Text Request
Related items