Font Size: a A A

The Design And Implementation Of PCF Based On Hadoop

Posted on:2015-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:X Z GuFull Text:PDF
GTID:2308330461455048Subject:Software engineering
Abstract/Summary:PDF Full Text Request
eBay, as a large C2C e-commerce-type company, has thousands of buyers and sellers to conduct transactions on its website every day and generating large amounts of data. Mining and discovery these data generated by users access can effectively help eBay executives make decisions, thereby improving the user’s buyers purchase rate. How to deal with these data, and how to make better use of these data, are the main reasons to birth this project.The origin of the paper is the eBay’s need for data analysis. The users’click order between the different pages of the website is a very important behavior. Analysis of this data will help eBay to improve search results, reasonable arrangements for the number and location of advertising pages, and also to help eBay sellers’analysis. Currently stored eBay buyers data is in chronological order, however used in accordance with the data stored in chronological order for page click distribution analysis will waste a lot of time on the data reading, so PCF system tries to use the new storage order to improve the analysis of speed.This paper describes that PCF system uses HDFS storage infrastructure, Map Reduce and Hadoop-based computing architecture Cascading architecture technology. In addition, in order to facilitate the use of public classes and methods, the system use Maven project management techniques.Users page click on the site will be recorded in a database which is called Sojourner. Very specific and detailed data in accordance with the Session in Sojourner, each page is sequentially stored in accordance with the ON time. In the mobile side, you can basically think the page opened in chronological order is the users’order to open these pages, but on the PC side, chronological order does not reflect the real user’s page click behavior, so this system recombines Sojourner data, changes its storage structure, so that it could withdraw cash user behavior and extract coarse-grained information displayed in the foreground. The system will store the data according to their logical order, so that the data support staff no longer need to write code to extract data to analysts, only need to maintain the stable operation of the system from the data warehouse. Data analysts no longer need to write your own data processing the method can be used directly to provide data processing system is complete, reducing the work of both sides, reducing the probability of error.This paper shows the raw data PCF need to deal with is stored in the Teradata data warehouse, Sojourner data tables, the data sheet in accordance with the date of the table, sorted by time, not suitable for the jump rate statistics page of the system needs treatment. So PCF decides to use HDFS unstructured data storage system, and use Map Reduce and the Cascading data calculation process based on Map Reduce. So can effectively analyze large amounts of data, and because Sojourner data sheet is insert only, so after each operation only in accordance with the time to calculate the new data can be inserted, the running time is short and manageable. Because Cascading, or Hadoop technology is mainly used to process the data, and the data flow is one-way, thus it is very suitable for Pipe and Filter mode for system architecture design systems. Sojourner, as a main source of input data, through Join, Filter and a series of operations in the system, becomes unstructured storage system format required for the analysis to accelerate the subsequent statistical process.This paper also introduces the main use of the system is a data analyst personnel through unstructured storage and handling, and the logical order according to the buyer user click data sorting, thus quickly calculate the buyer user clicks rates based on daily page hits, so that makes real-time data analysis possible to make better decisions services for eBay. The system provides two different ways to use for data analysis, one is simple and intuitive visual browsing pages, one is the use of HDFS or Hive to read background information and analysis, for different technical background data analysts to use.
Keywords/Search Tags:Maven, HDFS, Map Reduce, Cascading, Unstructured Data
PDF Full Text Request
Related items