Font Size: a A A

Research And Implementation Of Data Mining Based On Distributed Computing

Posted on:2011-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:P RanFull Text:PDF
GTID:2178330338486253Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Nowadays the network and the people on-line explode, there are TB level Web data generated by companies who are providing network services every day. These data are recorded the user's access behavior, which bear a high value information. Analysis and mining potential information from these data can get some interesting models, and these interesting models can help Internet companies to provide better network services. Internet companies often use data mining association rules mining algorithm to analyze the user's browsing behavior, to improve the site's user's viscosity, thereby improving the profitability of site. Internet data has massive, diverse, heterogeneous and dynamic changing characteristics, taking into account the huge data storage and processing efficiency, using traditional database in analyzing and processing the data cannot meet the requirements already. The emergence of distributed computing platform addresses the storage and calculation bottlenecks of massive data processing, so that huge amount of data mining has become possible.The traditional association rule mining algorithm used in the distributed computing platform is the core problem of massive data mining, the traditional association rule mining algorithm is only suitable for analyzing and mining the centralized data. Facing the distributed file system of Distributed computing platform, these association rule mining algorithms will fail. The improved Apriori algorithm is well adapted to the Hadoop's Map/Reduce computing model. This allows all steps of data mining - Data cleaning, data conversion and data mining can be applied to distributed computing platform, the models which mined by improved Apriori algorithm meet the requirements of the actual business logic, and these models have a high reference value for internet companies.The characteristic of this study is the integration of model research and business applications. Using leading edge distributed computing technical to solve the shortage of data mining massive data on traditional data mining solution, improved data mining algorithms well adapted to the distributed computing platform Hadoop, which play an advisory role on future using of other data mining algorithms applied to the Hadoop,and using the rich data mining algorithms can find more value which is covered by data.
Keywords/Search Tags:Distributed computing platform, Map/Reduce, Data Mining
PDF Full Text Request
Related items