Design And Implementation Of The Middleware System For Unstructured Textual Big Data

Posted on:2016-09-18

Degree:Master

Type:Thesis

Country:China

Candidate:C Yin

Full Text:PDF

GTID:2348330476955325

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Big Data mining has been a very hot topic in recent years. According to statistics, more than 85% of big data collected from the Internet is unstructured, some of which comes from automated generation, some from major newspapers and social media, and some from a variety of social software. This diversity of data sources results in noise and dynamic heterogeneity. Therefore, the purpose of data pre-processing process is to making data structured, usually through techniques like data cleaning, format unification, filtering and text vectorization. However, pre-processing unstructured text is a cumbersome and time-consuming task, which accounted for more than 60% of the overall workload of data mining. Especially in big data context, more emphasis has been put on the timeliness of data mining by big enterprises, who are insisting to increase the speed of calculating and cut down the mining cycle. Thus, the development of high-performance distributed data preprocessing middleware should provide most convenience to mining process at industrial level.Today, in order to deal with the eruption of big data, two basic capability for the data pre-processing system were highly demanded: First, storing and managing PB level of unstructured text data; second, completing the massive pre-processing tasks in short time. To meet these demands, this paper has study the following aspects in the context of textual big data analysis of mobile communications companies:(1) A framework of distributed middleware system for pre-processing unstructured text has been put forward. Based on Hadoop architecture, this system is aiming at solving low efficiency problem of single computing unit.(2) A data management system has been implemented by using Hbase, a column oriented distributed database. In order to correctly store the sharply increased data that cannot be handled by relational database, this paper has made comprehensive analysis of Hbase, including logical structure, physical structure, key-value format, and cluster optimization. Querying efficiency and load balancing are also considered while implementing this Hbase system.(3) Four pre-processing algorithm have been parallelized in Spark platform to solve the problem that traditional programs can only be executed by single machine. Due to the existence two distributed programing framework, Map Reduce and Spark, this paper has explained the reason for choosing Spark by comparing the advantage and disadvantages of the both framework.Finally, the system designed in this paper has been tested by comparative experiment of single computer and computing cluster using multiple indexes, through which its capability of pre-processing unstructured textual big data has been proved.

Keywords/Search Tags:

Hadoop, Hbase, Spark, Unstructured data, Text mining, Pre-processing

PDF Full Text Request

Related items

1	Research And Implementation Of Non Structured Data Management In Discrete Manufacturing Industry Based On Hadoop
2	Research On Parallel Data Mining Based On Hadoop
3	The Research And Implementation Of Bayesian Classification Algorithm In The Text Based On Spark Platform
4	Data Mining Based On Hadoop Platform
5	The Research And Implementation On Processing Technology Of Massive Network Traffic Log Based On Hadoop
6	The Research And Application Of Storage And Mining Methods For Massive In-Vehicle Information
7	Optimal Design And Application Of User Health Information Service Platform Based On Big Data Processing
8	Research On Parallel Mining Algorithm Of Association Pattern Based On Spark
9	The Key Technologies Research Of Web Text Mining Based On Hadoop
10	Vehicle Routing Data Processing System Based On Hadoop And C4.5 Algorithm