With the rapid development of Internet and distributed system, we have entered the era of "big data". Today’s "big data" has the characteristics of large amount and fast propagation velocity, the value of data will fall sharply as the time goes by. These characteristics bring great challenges to the processing of big data. Data batch processing based on Hadoop MapReduce can handle the large amounts of data. However, the interval of processing is usually hourly which makes it unable to process data in real time and can no longer meet the requirements of real-time data query.Aiming at the situation of high real-time requirement in big data processing. A distributed real-time data processing system based on Spark and HBase is designed and implemented in this paper, which has achieve real-time data transformation and query conversion and improved the usability. The main works of this paper includes:1. Optimize the HDFS file storage policy: Consider the workload of the DataNodes when the file is distributed. Reduce the hot spots and unnecessary file movement, so as to increase the parallelism of data computation and improve the real-time performance.2. Implement a general, configurable real-time data conversion program: Set the source data format, the source field conversion rules and filtering rules to define the logic of the task, and avoid repeated development.3. Provide secondary index for HBase: Build index using MapReduce.Use HBase coprocessor to intercept CRUD operations on HTable, in order to ensure the correctness of data in secondary index.4. Add SQL query interface for HBase: Parsing the SQL statement,implementes the scheme conversion of relational table and HTable,converts the logic of SQL statement and HBase operation. |