Font Size: a A A

Design And Implementation Of High Concurrent Heterogeneous Data Preprocessing System

Posted on:2018-07-12Degree:MasterType:Thesis
Country:ChinaCandidate:J LiFull Text:PDF
GTID:2348330512993106Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Nowadays it is an epoch of data,with the development of the technology of big data,to make it play a bigger role and offer a better service for users and enterprises,more and more industries need to use these new technologies to re-excavate the value of data which had been accumulated.Also,these are mostly incomplete and inconsistent dirty data,can not be used for data mining directly or mining results are unsatisfactory,it is very necessary to preprocessing data before other operations.The author has the honor to participate in the development of the project which is a search analysis platform of patent data and to be responsible for the work of design and implementation of fundamental support heterogeneous data preprocessing system of the platform.This thesis elaborates the system from these following aspects:background,meaning,domestic and international development present situation of this project,and requirement analysis,technology architecture,function structure,detailed design of database,detailed design and testing of this system.This system provides preprocessing and storage service of patent data to the platform.Because the patent data have these features:a huge number of data files which is scattered and have a small size,diverse languages,diversified format and incongruous sources of data,in the meantime,the patent data need to be loaded into database in a short time,therefore,this thesis designed a concept of index data which encapsulated the patent data and based on the Quartz framework,this thesis design and implement a function which use a multi-task and parallel way to load patent data into database,and use five different databases to satisfy the storage function.These five different databases are detailed as follows:Hybase,a search database,storage the data which need to be retrieved;MongoDB,a NoSql database,storage semi-structured data for foreground to display;distributed file system storage massive and unstructured data;Redis,a cache database,storage the business data which needs to be cached;MySQL,a relational database,storage the control,operation and maintenance data of the processing of data flow.Besides,these five kinds of databases are deployed in a distributed manner,in the meanwhile,these databases are adoptd different ways to ensure high availability,like Master-Slave model,duplicated hot-redundancy,ZooKeeper,etc.The system has five modules which are data loading and updating module,data quality control module,data recovery module,data monitoring module and scheduling task tool module.Data loading and updating module is the most important part of the system,when loading data into databases,this thesis designed an index data file is a batch which is used to load data into databases;and made use of the index data file which encapsulated patent data file,that can use the multi-task and parallel way to process data;and data loading is divided into multiple stages that the operations team can verify and rollback the data in each stage.The operations team can find the wrong data in time by using data quality control and data monitoring module.And they can use the data recovery module to repair the data,and use scheduling task tool module to copy index data file to workspace automatically.The system has been delivered and put on line as scheduled and has loaded the accumulated patent data into the database to provide consumers to use.At present,the system is running well,in the meanwhile,in order to improve the competitiveness of this product,the company is also promoting actively,the author believe there will be more consumers using this product in the future.
Keywords/Search Tags:Big Data, Heterogeneous Data, High Concurrent, Data Preprocessing
PDF Full Text Request
Related items