Font Size: a A A

Research On Parallel Processing Technology Of Large-scale Text Mining Under Cloud Computing Environment

Posted on:2018-12-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:W AiFull Text:PDF
GTID:1318330542483712Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the arrival of big data,the data is growing in the form of exponential,the amount of data has reached TB,PB,ZB level,a large part of which is text data.Behind the rapid growth of text data is the infinite value of these textual data.The emergence of text mining,people can find knowledge from the text data resources,to find out the law,resulting in value.The traditional text mining method can not effectively deal with large-scale text data,especially can not meet the user requirements of timeliness,and the emergence and develop-ment of cloud computing for fast and efficient processing of large-scale text data provides a solution.In order to make full use of the parallel computing power of cloud computing and its dynamic resource allocation ability,so as to effectively deal with large-scale text data,large-scale text mining parallel processing technology under cloud computing environment has become very important.In view of this,this paper aims at improving the efficiency of text mining and keeping the accuracy of mining.Combining the text clustering and the text name entity recognition technology,this paper studies the parallel processing technology of large-scale text mining under cloud computing environment from four aspects:algorithmic parallel strategy,algorithm parallelism and hardware co-strategy,parallel algorithm design of specific application,and parallel resource efficient management.The contributions of this paper include:Firstly,in the aspect of algorithmic parallel strategy research,in order to solve the prob-lem that the parameter estimation cycle is long and the time efficiency is poor of the condi-tional random field(CRF)model for text named entity recognition when dealing with large-scale text data,a MapReduce CRF parallel algorithm called MRCRF based on hadoop plat-form has been proposed.MRCRF deals with the time-consuming steps of the CRF model by combining and parallelizing the limited-memory Broyden-Fletcher-Goldfarb-Shanno(L-BFGS)and Viterbi algorithms,i.e.MRLB algorithm and MRVtb algorithm.The MRLB algorithm leverages the MapReduce framework to enhance the capability of estimating pa-rameters.Furthermore,the MRVtb algorithm infers the most likely state sequence by ex-tending the Viterbi algorithm with another MapReduce job.The proposed MRCRF method effectively partitions a large dataset to achieve optimal resource utilization and minimize the need for replication.Experimental results show that the MRCRF algorithm outperforms existing traditional CRF model by exhibiting significant performance improvement in terms of time efficiency as well as preserving a guaranteed level of correctness.Secondly,in the aspect of algorithmic parallel and hardware collaborative strategy re-search,in order to further improve the performance of the conditional random field(CRF)model of text named entity recognition in the large data environment,a distributed parallel CRF called DHCRF has been proposed based on GFlink platform which is a heterogeneous CPU-GPU cluster.DHCRF leverages a three-stage heterogeneous Map and Reduce oper-ation to improve the performance,making full use of CPU-GPU collaborative computing capabilities in a big data environment.Furthermore,by combining elastic data partition and intermediate results multiplexing method,the DHCRF is optimized.The elastic data partition is performed to keep the load balanced,and the intermediate results multiplex-ing method is adopted to reduce data communication.Experimental results show that the DHCRF outperforms the baseline CRF algorithm and the existing parallel CRF algorithm with notable performance improvement while maintaining competitive correctness at the same time.Thirdly,in the aspect of design of parallel algorithm for specific application,to solve the problem that the lack of an effective microblogging hot topic detection method in the large data environment,a parallel two-phase mic-mac hot topic detection(TMHTD)method has been proposed,which is implemented in the Apache Spark cloud computing environ-ment.To improve the accuracy of the hot topic detection,three optimization methods,along with TMHTD,are proposed.To handle large databases,we deliberately design a group of MapReduce jobs to concretely accomplish the hot topic detection in a highly scalable way.Extensive experimental results indicate that the accuracy and performance of the TMHTD algorithm can be improved significantly over existing approaches.Fourthly,in the aspect of efficient management of parallel resources,the paralleliza-tion uncertainty of the text mining brings about the change of the calculation task,which will require frequent changes to the cluster resource configuration.Elastic cloud computing platform has not been able to solve the problem of resource allocation for users on resource ease of use.To solve the problem,according to the elasticity of the cloud platform deter-mines the amount of parallel computing resources required by the user,and the user can evaluate the elastic cloud computing platform through elasticity measurements,the paper present a new definition of elasticity measurement,and propose a quantifying and measur-ing model according to the characteristics of the text data set.The model is easy to use for precise calculation of elasticity value of a cloud computing platform,and can predict the number of parallel resources and other performance indicators according to the number of text data sets,thus providing platform selection and resource allocation guidance for non-professional users,and achieve efficient management of parallel resources.The numerical results demonstrate the basic parameters affecting elasticity as measured by the proposed measurement approach.Furthermore,the simulation and experimental results validate that the proposed measurement model is not only correct and effective but also can be used as a general model for cloud platform elasticity measurement.
Keywords/Search Tags:Big data, Cloud Computing, Text mining, Parallel Processing, Text named entity recognition, Text clustering, Parallel resource efficient management
PDF Full Text Request
Related items