Font Size: a A A

Design Of Data Visualization Platform Based On CDH

Posted on:2019-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y J HaoFull Text:PDF
GTID:2428330548482479Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
In recent years,because of the development of technology and the passage of time,our life has been completely inseparable from the computer,followed by the production of a large number of data.In the traditional industry,the traditional method use the traditional database to store data.When facing a large number of data,there is no way except to use the technology of distribution,but when the amount of data reaches a higher level,there will appear performance-bottleneck that can not be solved.So many big companies are starting to develop new architectures that adapt to new data volumes.The visualization technology built on the traditional database has been more mature and perfect,but the data visualization which includes large data and the conclusion of the high-end algorithms such as machine learning become more complicated,and it is very difficult to design system architecture using the same way.Currently,a lot of new data visualization technology which are belong to opening-source project is emerging,each has its own advantages and disadvantages.But for many companies or individuals do not know how to choose,because the cost of studying new technology is too high if it want to know all of its detail.The aim of this paper is to achieve the goal that using these opening-sources to design a data architectures,which is stable,robust,scalable and manageable for system,and it will not lose the original data information and it will be approachable and visual for people.Today,the large data visualization environment is in the stage of exploration,so all kinds of opening-source projects and business projects are springing up like mushrooms.It means to store and manage large data at the present stage are implemented through Hadoop and other components in its ecosystem.How to select the components of the entire system architecture and other opening-source and convenience projects is a difficult choice.The structure method selected in this paper is a result which was tested,so the effectiveness and practicality are fully in line with the purpose of our thesis.The specific data storage method is selected using HDFS of the Hadoop cluster system for data storage.After taking into account the maintenance and operation of the later cluster,the project chooses the Hadoop of the CDH version published by Cloudera company,and uses the Hadoop management tool called Cloudera Manager(CM)which is published by Cloudera company.In Hadoop,the data warehouse called Hive is used in this project,the data analysis engine called Impala also be used,and the column database called HBase are also included,and the Hive metadata(which is called matestore and to be used of storing the information of the table in the Hive)is stored in a database management system which is called Mysql.All of these are installed by CM,and the data display tool selected by this project is an opening-source project named CBoard on Github.The implementation of the project is to design a good data model based on the requirements of the project.According to the data model,the original data is extracted from the database to the Hive data warehouse by using the ETL tool--kettle.And then the data are synchronized to the Impala database.Then the data is visualized by connecting Impala with CBoard,and displayed by CBoard.The construction of the whole environment and the data modeling,and the display are all achieved hoped-forly.
Keywords/Search Tags:Hive, Hadoop, Visualization, Unstructured-data
PDF Full Text Request
Related items