Font Size: a A A

Distributed Log Information Processing With Map-Reduce

Posted on:2012-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:M Y LuoFull Text:PDF
GTID:2178330335460562Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the high development of Internet, e-commerce websites now routinely have to work with log datasets which are up to a few terabytes in size. How to remove messy data timely with low cost and find out useful information is a problem we have to face.This Paper is based on Map-Reduce parallel processing platform. It introduces the processing of log information from raw data to final model and implement data extraction, clustering algorithm for a huge amount of data. Finally, we can cluster the users who access website through their click information. By effective treatment, hadoop cloud computing platform avoid long time processing or having no result. It solves the problem of single machine. Although it cost very low, it can implement large-scale raw data preprocessing and clustering.We make access the log information as source data. Map-Reduce has two stage. In map stage, we extract useful information. In reduce stage, we do summation operation. Join operation and its improvement method based on map-reduce are studied here. After above processing, we make Vector Space Models to represent the users interest.In particular, we focus on clustering algorithms. A clustering algorithms which integrate SOM(Self-Organized Map) and fuzzy logic is combined with Map-Reduce. Traditional fuzzy clustering algorithms run a long time and have complex computational. With the help of hadoop cluster, large calculation of jobs can be accommodated easily by just adding more nodes or computers to the cluster.
Keywords/Search Tags:map-reduce, distributed data mining, data pre-processing, join operation
PDF Full Text Request
Related items