Font Size: a A A

Query And Analysis Of News Events Data Based On Hadoop

Posted on:2019-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:M H HanFull Text:PDF
GTID:2348330542498166Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of Internet technology,the amount of news data has exploded.GDELT is such a large and rapidly growing global news event dataset that has more than 480 million data so far.When dealing with massive datasets,traditional MySQL/Oracle systems are mostly used in stand-alone environments and their performance is limited,resulting in insufficient scalability of storage space,inefficient analysis and computation,inability to guarantee real-time query response,and ultimately weakened user interaction.The main content of this paper is how to efficiently achieve rapid retrieval and discover new values from large-scale data sets.The paper analyzes the open source distributed storage and search technology.Based on the Hadoop platform,through the comprehensive consideration of index efficiency,query speed and computational efficiency,the main technical route is to determine Solr as the distributed search engine and Spark as the core computing Engine,put forward an efficient and applicable solution.Through the interaction between Spark and Solr distributed technology,give full play to their parallel features.Based on the Spark memory calculation model,the original data is filtered,aggregated,and aggregated to generate an aggregated statistics table,and then write back Solr.As a result,the dimension reduction of the original large table is reduced and the calculation amount of the subsequent query is simplified.Different query requirements directly in the corresponding statistical tables on request,the module is the key to the system to respond quickly.Based on the above key technologies,an online news service system composed of ETL module,statistical module and query analysis module is designed and implemented.The ETL module is responsible for batch indexing the raw data from the HDFS to the search engine Solr.The statistics module is responsible for pre-aggregating the data in the Solr to generate the statistics table.The query analysis module is used for visualizing the results of the news data query and analysis.The research work and experimental results shows that the news service system which integrates data collection,storage optimization,statistical calculation and query result visualization has realized the efficient storage and quick retrieval of massive data,and verified the validity and practicability of the proposed method.
Keywords/Search Tags:big data, news events analysis, distributed search engine, Spark
PDF Full Text Request
Related items