Font Size: a A A

Design And Implementation Of Distributed Data Examination System Based On Hadoop

Posted on:2017-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:Z W HuFull Text:PDF
GTID:2308330485460529Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Rapidly developing internet is producing huge amounts of data every day, and data storage units have grown from GB, TB to PB, EB even ZB, YB, so big data era also came into being. Data mean that value, but not all data have the value we want, we are required to assess the quality of the data before using the data. Traditional data filtering methods are not suitable for the big data era because of the disadvantages of more human resource consumption, not comprehensive enough, low efficiency and so on. In the environment of continuous data generation, we assess the quality of the data in addition to the accuracy, completeness, consistency, validity, stability, and in particular, we need to consider the timeliness of the data. So it is extremely important to research and apply the technology of highly efficient data quality assurance.During internship in Baidu, the author participate in the development of distributed data examination system based on Hadoop, and according to the thinking of software engineering completed requirements analysis, design, implementation and testing of the core functionalities that include data health, data compare and indicator detail export. First of all, the author makes a general understanding of the user’s requirements by communicating with the target user and analyzing the business process of the product. After a more detailed analysis of the core functional requirements, the author analyzes the system use cases that detailed the scope of the system’s function, and further clarify the functional requirements and non functional requirements of the system. Then depending on the requirements, the author completed the summary design of system, including the system architecture design, data interface design and database design. Finally using Redis, Python, Java, HDFS, MapReduce, Hadoop Streaming and other technologies to achieve a highly efficient distributed data examination system that support for highly concurrent tasks and processing large data (greater than IT). The Web part of the system is based on the Yii PHP framework, AngularJS, Baidu Echarts graphics library, etc. With data health indicators(field type, field info, coverage rate, error rate) and data compare indicators(new, missing, different, same) visualization, user can visually view data quality assessment. At the same time, user can utilize the indicator detail export function to keep track of the problem. The system also has a complete detection system that can automatically detect data encoding, data structure, and field types for the data of Table 2D table and JSON/XML tree structure format. The system based on the detection results to accomplish the data examination, to achieve the purpose of the use of zero configuration. At the same time, the system also supports user customizes the examination rules, has greatly expanded the scope of application of the system.Since the on-line 3 months, the system has provided data quality assurance services to 41 departments and 167 users, has finished 1530 data examination tasks, with 169TB of the monthly average data volume. The system is easy to use, in addition to providing the Web side, also provides a complete API call, in order to access other projects and timely find data quality problems. At the same time, the system provides Email notification service, promote the data responsible person to keep track of the problem, provide data quality assurance to other products.
Keywords/Search Tags:Data Quality, Data Examination, Distributed, HDFS, MapReduce, Hadoop Streaming
PDF Full Text Request
Related items