Font Size: a A A

Distributed Clustering Of Graph Data And Application To Data Mining For E-commerce

Posted on:2014-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:W JieFull Text:PDF
GTID:2248330395981044Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As a common data structure, graph is the composition of node and the connections between these nodes. It has become modeling tools for a variety of complex objects and the link of these objects. When a customer logs into the E-commerce website and purchases some goods, the related data will be saved in the database. According to these transaction data, a variety of customer relationship graphs can be built. Taking the customer relationship constructed by those who buy the same kinds of items for example, the nodes of the graph represent different clients and the edge represents that two customers buy the same items on the site. Similar to other types of data, the customer relationship graph contains a wealth of information and knowledge, which has practical application value in the customer relationship management of E-commerce sites.Graph clustering analyzes those clusters which are internal closely and external loosely by the clustering technology. Graph clustering has been widely used in various areas such as the discovery of a community in the social network and the detection of the complexes in the protein. Using graph clustering methods, different customer groups can be excavated from customer relationship network graph mentioned above. The resultant customer group clusters may either represent that the customers have similar interests and preferences, or represent that these customers have the similar family structure, age range and so on. The information plays an important role in personalized recommendation, developing more targeted marketing strategies and enhancing the operation of websites.Some mainstream e-commerce sites, such as Taobao and One shop have large scale of customers that the relationship graph formed by them is very large too. Because of the large amount of data, single workstation, regardless of the CPU computing power or memory consumption, is unable to meet the demand, therefore cluster analysis is unable to be excuted properly. How to dig out the customer group cluster in a large-scale customer relationship graph has become the common concern of the related industry.As a parallel programming model, MapReduce is especially suitable for parallel processing of the large-scale data for it can connect hundreds or even thousands of computers with each other and integrate huge system resource pools forming a huge cluster of machines. Considering the advantages of MapReduce in processing large data, a distributed graph clustering method based on MapReduce and the traditional clustering method is put forward and applied to the discovery of customer relationship.This paper bases on the project named "website of the steel trade transaction data analysis". According to the transaction data of a steel transaction company collected from the year2006to the year2011, the paper works out the customer groups of steel transaction by graph clustering method, which provides decision support for the company to develop effective marking strategies.Firstly, the paper introduces related technologies, including data mining, graph clustering, MapReduce parallel framework, and its open source implementation(Hadoop).Secondly, taking a steel trade e-commerce site for an example and combining with the actual characteristics of the transaction data of steel trade, the paper elaborates the building process of steel trade transaction data warehouse, and describes the graph modeling of steel trade customer relationship in detail.Thirdly, being based on MapReduce framework. MR-LSH is proposed, which is a distributed graph clustering algorithm based on MapReduce.This practice can solve how to make use of LSH to achieve scalable parallel clustering of large-scale graph data. MR-LSH is a combination of MapReduce parallel framework and Locality Sensitive Hash(LSH),and implement a distributed clustering algorithm based on the position sensitive hash in MapReduce parallel framework. The paper will discuss the specific idea of MR-LSH algorithm, its implementation framework, and the implementation of various steps in detail.On this basis, according to the customer relationship graph based on the transaction data of a steel trading company collected from the year2006to the year2011, the paper proves the feasibility and practicality of distributed graph clustering method in data mining of E-commerce. The experimental results show that the system is safe and reliable, easy to maintain, and has good scalability.
Keywords/Search Tags:distributed clustering, graph clustering, date mining for E-commerce, Hadoop, MR-LSH algorithm
PDF Full Text Request
Related items