Distributed Log Information Processing With Map-Reduce

Posted on:2012-05-16

Degree:Master

Type:Thesis

Country:China

Candidate:M Y Luo

Full Text:PDF

GTID:2178330335460562

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the high development of Internet, e-commerce websites now routinely have to work with log datasets which are up to a few terabytes in size. How to remove messy data timely with low cost and find out useful information is a problem we have to face.This Paper is based on Map-Reduce parallel processing platform. It introduces the processing of log information from raw data to final model and implement data extraction, clustering algorithm for a huge amount of data. Finally, we can cluster the users who access website through their click information. By effective treatment, hadoop cloud computing platform avoid long time processing or having no result. It solves the problem of single machine. Although it cost very low, it can implement large-scale raw data preprocessing and clustering.We make access the log information as source data. Map-Reduce has two stage. In map stage, we extract useful information. In reduce stage, we do summation operation. Join operation and its improvement method based on map-reduce are studied here. After above processing, we make Vector Space Models to represent the users interest.In particular, we focus on clustering algorithms. A clustering algorithms which integrate SOM(Self-Organized Map) and fuzzy logic is combined with Map-Reduce. Traditional fuzzy clustering algorithms run a long time and have complex computational. With the help of hadoop cluster, large calculation of jobs can be accommodated easily by just adding more nodes or computers to the cluster.

Keywords/Search Tags:

map-reduce, distributed data mining, data pre-processing, join operation

PDF Full Text Request

Related items

1	Research On Some Key Technologies Of Parallel Processing For Big Data Based On Map Reduce
2	The Research On Optimization Of Data Join Operation Based On GPU
3	Join Query Optimization For Large-Scale Data Based On New Computing Architecture
4	Join Prpcessing And Optimizing On Large Clusters
5	Pretreatment Design Patent Image Retrieval Method Based On Map-Join-Reduce
6	Earch On Data Skew In Join Base On Hadoop
7	Efficient Star Join For Column-Oriented Data Store In The MAP Reduce Environment
8	Result Completeness Guarantee Strategy Studies In Distributed Stream Join Systems
9	Analysis of load distribution strategies for signature search and join operation in distributed computing systems
10	Research On Optimization For Multi-way Join In A Map-Reduce Environment