Font Size: a A A

Research And Application Of Data Preprocessing Technology For Video Website User Behavior Data

Posted on:2018-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:H L YangFull Text:PDF
GTID:2348330542486980Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the progress of information technology,people enjoy the wonderful life which the Internet video and on-demand business brings.Meanwhile the video site has accumulated a large number of user behavior data.Analysis to these data takes a long time and enormous resource consumption so on,which causes huge troubles to the analysts.Meanwhile,the user behavior data of video website is very difficult to be excavated under the acceptable condition because of the large dimension and large data volume of the individual dimension data.These situation for the enterprise is almost unacceptable.In order to solve this problem,distributed parallel computing and space-time thinking is an inevitable combination.In dealing with the needs of this scene,cube materialization witch belong to data preprocessing aspect has become the key.In this paper,MapReduce is used to calculate the data cube because of high throughput and parallel frame in the preprocessing system.Moreover,the preprocessing system uses space to exchange time and perform data cube pre-operation,which reduces the cost of time spending on analysis.For the calculation of two type data cubes with different metrics,this paper presents two methods of computation.The Multi-RegionCube algorithm is designed to calculate algebra cube containing only the algebraic metric.The algorithm with special bitmap structure is designed to calculate data cube containing holistic metric.At first,the paper present a method to encode real value of source table.Then in the process of processing algebraic metric,the Multi-RegionCube algorithm proposes a strategy of partitioning the data cubical lattice according to the problem of over-calculation and long computation time in the middle-level calculation of layer-by-layer computation.The processing methods of different regions are different according to their characteristics.And the minimum parent dimension combination is determined by the sampling method to shrink computation amount.As for the materialization algorithm of holistic metric data cube,this paper proposes to use the highly compressed Bitmap data structure to calculate the Count Distinct measure and reduce the memory footprint by high compressibility.After introducing the core materialization algorithm,Multi-RegionCube algorithm is proved effective in computing efficiency,materialization speed and load balance by experiment.As for the holistic algorithm for data cube containing holistic metric,the paper designs experiments to prove the effectiveness in computing efficiency and introduce the advantage in reuse.In the end of the paper,the design and implementation of the system are put forward.The system has been used in the formal environment,and the performance is stable.
Keywords/Search Tags:preprocessing, data cube computation, MapReduce, Distributed computing
PDF Full Text Request
Related items