Distributed frameworks towards building an open data architecture

Posted on:2016-01-23

Degree:M.S

Type:Thesis

University:University of North Texas

Candidate:Venumuddala, Ramu Reddy

Full Text:PDF

GTID:2478390017976933

Subject:Computer Science

Abstract/Summary:

Data is everywhere. The current Technological advancements in Digital, Social media and the ease at which the availability of different application services to interact with variety of systems are causing to generate tremendous volumes of data. Due to such varied services, Data format is now not restricted to only structure type like text but can generate unstructured content like social media data, videos and images etc. The generated Data is of no use unless been stored and analyzed to derive some Value. Traditional Database systems comes with limitations on the type of data format schema, access rates and storage sizes etc. Hadoop is an Apache open source distributed framework that support storing huge datasets of different formatted data reliably on its file system named Hadoop File System (HDFS) and to process the data stored on HDFS using MapReduce programming model.;This thesis study is about building a Data Architecture using Hadoop and its related open source distributed frameworks to support a Data flow pipeline on a low commodity hardware. The Data flow components are, sourcing data, storage management on HDFS and data access layer. This study also discuss about a use case to utilize the architecture components. Sqoop, a framework to ingest the structured data from database onto Hadoop and Flume is used to ingest the semi-structured Twitter streaming json data on to HDFS for analysis. The data sourced using Sqoop and Flume have been analyzed using Hive for SQL like analytics and at a higher level of data access layer, Hadoop has been compared with an in memory computing system using Spark. Significant differences in query execution performances have been analyzed when working with Hadoop and Spark frameworks. This integration helps for ingesting huge Volumes of streaming json Variety data to derive better Value based analytics using Hive and Spark.

Keywords/Search Tags:

Data architecture, Distributed frameworks, Social media, Using hive, Streaming json, Data access layer

Related items

1	Research On A Data-Driven Media Streaming Application Layer Multicast Strategy
2	Based On The Three Layer Architecture Of Distributed Storage System Access Technology Research
3	Research And Application Of Database Access Layer Based On JSON
4	Design And Implementation Of Agricultural Data Security Exchange Based On Json
5	Research On Technologies Of Data Scheduling And Transmission Layer Optimization In P2P Streaming Systems
6	Design And Implementation Of Contextual Marketing Based On Distributed Computing Hive And Data Mining
7	Method And Implementation For Hive-Based Offline Data Processing
8	Design And Application Of Management & Control System On Thermal Power Plants
9	The Research And Application Of Social Media Data Management System Based On Crawler Technology
10	Research On The Stability Of Media Streaming Distribution In The Next Generation Internet