Font Size: a A A

Design And Implementation Of The Middleware System For Unstructured Textual Big Data

Posted on:2016-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:C YinFull Text:PDF
GTID:2348330476955325Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Big Data mining has been a very hot topic in recent years. According to statistics, more than 85% of big data collected from the Internet is unstructured, some of which comes from automated generation, some from major newspapers and social media, and some from a variety of social software. This diversity of data sources results in noise and dynamic heterogeneity. Therefore, the purpose of data pre-processing process is to making data structured, usually through techniques like data cleaning, format unification, filtering and text vectorization. However, pre-processing unstructured text is a cumbersome and time-consuming task, which accounted for more than 60% of the overall workload of data mining. Especially in big data context, more emphasis has been put on the timeliness of data mining by big enterprises, who are insisting to increase the speed of calculating and cut down the mining cycle. Thus, the development of high-performance distributed data preprocessing middleware should provide most convenience to mining process at industrial level.Today, in order to deal with the eruption of big data, two basic capability for the data pre-processing system were highly demanded: First, storing and managing PB level of unstructured text data; second, completing the massive pre-processing tasks in short time. To meet these demands, this paper has study the following aspects in the context of textual big data analysis of mobile communications companies:(1) A framework of distributed middleware system for pre-processing unstructured text has been put forward. Based on Hadoop architecture, this system is aiming at solving low efficiency problem of single computing unit.(2) A data management system has been implemented by using Hbase, a column oriented distributed database. In order to correctly store the sharply increased data that cannot be handled by relational database, this paper has made comprehensive analysis of Hbase, including logical structure, physical structure, key-value format, and cluster optimization. Querying efficiency and load balancing are also considered while implementing this Hbase system.(3) Four pre-processing algorithm have been parallelized in Spark platform to solve the problem that traditional programs can only be executed by single machine. Due to the existence two distributed programing framework, Map Reduce and Spark, this paper has explained the reason for choosing Spark by comparing the advantage and disadvantages of the both framework.Finally, the system designed in this paper has been tested by comparative experiment of single computer and computing cluster using multiple indexes, through which its capability of pre-processing unstructured textual big data has been proved.
Keywords/Search Tags:Hadoop, Hbase, Spark, Unstructured data, Text mining, Pre-processing
PDF Full Text Request
Related items