Font Size: a A A

Distributed Association Rules Algorithm Based On The Spark

Posted on:2018-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:X H LiuFull Text:PDF
GTID:2348330542987335Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the progress of science and technology,the amount of information in today's society is with the explosive growth of the trend.So you need to mine these large amounts of data for the social life and even the national development to find out the valuable and meaningful knowledge.Not only that the concept of data mining and data mining has been widely used in our daily life.The concept of data mining is no stranger to today's society.It is a process that uses some kinds of mining method to find valuable data from the mass information and to extract implicit information in large-scale data.Through the data mining technology,people can get meaningful and useful knowledge from a large amount of the data,and then according to the findings,people can use the knowledge to make a decision and proper judgment for some important things in the social life.For example,Supermarket managers mining the commodity trading data in the shopping mall,so that they can get help to improve the information for the sales and service,and analyze the information of mining to understand the customer's buying preferences,determine the reasonable scheme of goods where to put,only in this way can bring huge profits to the mall.Not only so,the purpose of the data mining is very wide,China's science and technology,medical and other fields all need the data mining technology.In recent years,there has a lot of information about mining the association rules between the database to be put forward,including FP-tree algorithm and the improved method of them.These algorithms can help people find information in the huge information data.But now the large-scale data to the previous association rules algorithm brings many problems,such as low efficiency,take up a lot of storage space,etc.Therefore,many researchers have proposed a distributed and cloud computing technology in big data to mining valuable data information.However,in order to make fully improve the speed of data mining,people must use a large number of compute nodes in the network environment.Many servers work in the same network segment,when many different tasks at the same time are transmitted,the available network bandwidth will be limited.No matter for internal or external network,it will lead to the transmission speed become more and more slow,that causes serious transmission delay.Therefore,this paper use the CSFP-tree algorithm on the primary node for data preprocessing,and then put the preprocessing of data to Spark distributed processing environment to achieve efficient data mining.Compared to the Hadoop,Spark environment digs in memory,so it is more conducive to the iterative algorithm.
Keywords/Search Tags:Distributed data mining, Association rules, FP-tree algorithm, Spark
PDF Full Text Request
Related items