Font Size: a A A

The Design And Implementation Of Parallel Conditional Random Fields Algorithm Based On Spark Platform

Posted on:2018-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:Z R GongFull Text:PDF
GTID:2348330542461658Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,data from all walks of life in the form of explosive exponential increase,the era of big data has come.How to effectively analyze and mine this huge amount of data has become the focus of attention now.It is necessary to deal with large amount of irregular data efficiently,and the traditional stand-alone mode can not meet the demand.The emergence of distributed technology has greatly promoted the analysis and mining of large data.Spark and Hadoop are the two most widely used distributed parallel computing frameworks.This paper uses Spark as the large distributed data processing platform.Because Spark not only has the advantages of Hadoop MapReduce,but also has the memory based computing,a scheduling optimization and richer operator expressions.For high iterative complex machine learning algorithm,Spark has huge advantages.CRFs(the conditional random fields)is a probability graph model,it can integrate a variety of features,and can find out the corresponding relationship between the observation sequence and labeling sequence.Conditional random fields have been widely used in many fields of Natural Language Processing(NLP).With the rapid development of the Internet and information system,the traditional conditional random field will encouter efficiency problem when dealing with large data.Existing works of parallel conditional random fields are mainly based on MPI,GPU and Hadoop.However,they do not consider the bottleneck of bandwidth and disk I/O in the distributed cloud environment.Based on the distributed memory computing framework Spark,this paper proposed an optimized parallel conditional random field model(SCRFs).The main work includes the following aspects:(1)Based on the Spark platform,this paper parallel the three process of conditional random fields:feature generation,parameter training and model prediction,which improves the time efficiency of the model.Because conditional random fields requires multiple iterations in the training process,each iteration need to converte training data to intermediate results,so this paper choiced intermediate result cached into memory to avoid repeated computations,thereby reducing the computation time.(2)The model is improved from two aspects:It is observed that during each iteration of the parameter training,the network cost is high due to the large size of the feature vector of the model.Hash feature is used to reduce feature dimension,with minimal cost to retain most information of original feature;the LBFGS algorithm need to calculate the gradient of all data to update parameters,resulting in low efficiency when dealing with large data.The Batch-SGD method is used to achieve faster iterative computations for large data.(3)Implementation the improved the conditional random fields SCRFs on the Spark.This paper used Spark cluster as the experimental environment,in this environment we improved the SCRFs model.We finally tested the classification accuracy and recall rate,F1 value,time performance and the speedup ratio in Spark cluster.The results show that the improved model has obvious advantages in dealing with large data on Spark platform.
Keywords/Search Tags:The Conditional RandomFields, Big Data, Hadoop, Spark, Machine Learning
PDF Full Text Request
Related items