Font Size: a A A

Design And Implementation Of Distributed System On Data Collection And Analysis

Posted on:2019-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:W ZhangFull Text:PDF
GTID:2428330572456447Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the appearance of the Internet plus era,network data has exploded,but more and more valuable network data information cannot be obtained by traditional search engines in time such as the number of e-commerce product orders,customer comments on product,OTA hotel room information and reply information on microblog.However,these data information which are not available for collecting on traditional search engine has great significance and value for the investment decisions of modern enterprises and the research for the social sciences of scientific research institutions.Under the background that the traditional search engines can no longer meet the demand on the comprehensiveness,timeliness,and personalization of network data of modern enterprises,research institutions and even individual investors,how to efficiently obtain Internet hotspot information and to analyze and process the differentiated and refined data has become an urgent need.To solve the above problems,this paper designed and implemented a distributed system on data collection and analysis.Based on the virtualization technology supporting the underlying virtual server,the system constructed big-data processing platform relying on Storm and Hadoop as the system processing framework for data collection and analysis.In the real-time distributed processing platform Storm,the modular data collection functional unit has been designed and realized,including URL building module,anti-climbing strategy scheduling module,data labeling and parsing module,and data formatting module.And the functional unit was regarded as the Internet data collection and processing front-end.And using No SQL type database HBase and Redis as database middleware,the back-end data analyzing and processing platform were connected with the system front-end.Back-end data analysis and processing platform Hadoop using database middleware as interface obtained the collected and processed data from front-end,and then performed word segmentation for Chinese language on data.And then,a large amount of data after word segmentation was passed to the text correlation analysis module for statistical and analytical processing through the Bayesian decision analysis algorithm within the module.Finally,the data visualization platform ELK displayed the statistical and analytical data results in a Web diagram.Combining Storm real-time streaming processing,database middleware,Hadoop batch processing,and ELK data visualization,the system framework achieved both real-time processing and batch level data collection and analysis processing capabilities,and improved the compatibility,error-tolerance and scalability of the distributed data collection and analysis system and reinforced the adaptability of this distributed data collection and analysis system for differentiated demands.The project studied the design and implementation of the system architecture for the generation,collection and analysis of network information data,and tested the function and performance of the distributed data collection and analysis system by setting up a test environment to ensure that the system design was reasonable and feasible,and every functions and performance met the design requirements.
Keywords/Search Tags:Data crawling, URL building module, anti-climbing strategy, Chinese word segmentation, Bayesian decision algorithm
PDF Full Text Request
Related items