Font Size: a A A

Research And Implementation Of Big Data Analysis And Mining Technology Based On Hadoop In Telecommunications Industry

Posted on:2017-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:H J CuiFull Text:PDF
GTID:2348330518995646Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology,data generated in the rapidly expanding scale,facing of such vast amounts of data,data mining technology will be developed.Big data brought both challenges and opportunities,how to dig out the useful information from such a large amount of data,is a challenging task.There are a lot of customer data in the communications industry,the use of large data related to technical analysis,dig out the potential of knowledge in order to improve the service experience is a meaningful task.This paper work done in this context as follows:First,in terms of algorithms,this paper use clustering algorithm to customer segmentation,use a decision tree algorithm to customer forecasts.The traditional K-means algorithm need to enter the number of clusters,but facing such huge amounts of data,it is not clear of data distribution,which makes it difficult to use this algorithm.To solve these problems,K-means clustering algorithm in this paper improve implementation,a DGK-means algorithm to calculate the most appropriate number of clusters using genetic algorithm,using density of thought to calculate fitness function,improve efficiency and accuracy of the algorithm.C4.5 decision tree algorithm use the test data set to build a decision tree model,use the data model to predict unknown results to achieve customer forecasting and customer retention goals.Secondly,according to the needs of large data mining analysis,using Hadoop platform for big data analysis and mining,this paper designs and implements a Hadoop-based big data analysis of the communications industry mining system,which use HDFS distributed storage of data and the MapReduce to parallel computing.Clustering algorithm and decision tree algorithm were designed in parallel.The distributed data storage in Hadoop platform for parallel computing provides convenient,but also the decision tree algorithm using parallel computing design reduces pruning,improve the efficiency and accuracy of the algorithm.Finally,the use of test data sets for performance systems and algorithms were validated,accuracy and efficiency show DGK-means algorithm have been improved compared to traditional algorithms;in the case of parallel computing,efficiency has been improve when cluster node number is greater than 2,and with the increasing of the number of nodes in the cluster efficiency improvement is more obvious.
Keywords/Search Tags:data mining, Hadoop, K-means, decision tree algorithms, parallel computing
PDF Full Text Request
Related items