Font Size: a A A

Research On The Parallelization Of Text Categorization Based On Convolution Neural Network

Posted on:2019-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:A Y LiangFull Text:PDF
GTID:2428330545976034Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the popularization of the Internet and the rapid development of computer technology,the network information data has seen the explosive growth,which mostly exists in the form of text.Under such a trend of increasing data volume,how to automatically classify massive,disorganized and non-standardized text data according to certain rules within a limited period of time has become a hot issue to be resolved in the field of natural language processing at home and abroad.At present,the online text information is mainly characterized by these features: strong real-time,large data volume,sparse features and non-standard expression;The existing classification algorithms in stand-alone mode mainly have these deficiencies: long running time and low accuracy.In order to effectively solve the shortage of the above two aspects,based on the research of the existing classification algorithms,this paper takes news text as research object,and conducts in-depth research and analysis on association rule algorithm,convolutional neural network algorithm and Spark distributed computing platform and so on.The main three tasks of this article are as follows:(1)A Spark-based association rule optimization algorithm is proposed and applied to the field of text mining.Aiming at the problems of sparse text and insufficient semantic expression,this dissertation focuses on the related research of association rules algorithm,and proposes an algorithm called Apriori_MC,which is an optimization algorithm for association rules,and is combined with the Spark distributed platform to achieve parallel computing and improve the computational efficiency of the algorithm.By comparing with the existing partial association rule algorithm,the feasibility and extensibility of this algorithm are verified,and the correlation in the text corpus is used by the algorithm.(2)Design a parallel model of convolutional neural network based on Spark platform.This section first describes the traditional CNN algorithm,mainly introduces the design structure,activation function,and parameter optimization of the algorithm.And a CNN_SP method for parallel training CNN model based on Spark platform is designed for aiming at the problem of long training time and data of traditional CNN algorithm.This method uses divide and conquer ideas to decompose the training sample into uniform data blocks,and then distributes the data blocks to each work node in the Spark cluster.Each work node has a complete set of CNN models,and each work node is executed.After that,the arithmetic mean of the generated intermediate results is used to obtain the weight value,which is transmitted to each node in the form of broadcast to achieve model update.The model stops when the number of iterations reaches the upper limit or the network converges to a certain threshold.The experiment verifies the feasibility of the CNN model in Spark cluster,which can speed up the execution efficiency of the algorithm and reduce the memory consumption in the single machine environment.(3)Design a two-input convolutional neural network model combining feature extension and sentence component extraction and realize the parallelization.Aiming at the problems existing in the convolutional neural network used in(2),the EE_CNN model is proposed.The model first analyzes the pre-processed samples and extracts the key components(such as subject,predicate,object,etc.)in the sentences.Then the features of the samples are extended based on the correlations and similarity relationships in 1);Finally,the algorithm is combined with parallelization techniques to improve the efficiency of the algorithm.Experiments show that the accuracy of the EE_CNN model is improved by about 3% compared with the SVM algorithm;compared with the traditional CNN model,the accuracy of the algorithm is improved by about 2%;at the same time,the EE_CNN algorithm is applied to the Spark cluster,and the efficiency is improved.Significantly improved,which proves the efficiency of the algorithm.In a word,based on the theoretical research of text classification and parallelization technology,this paper focuses on the research of association rule algorithm and the optimization implementation of CNN algorithm on Spark platform.The experimental results show that the combination of the two algorithms can solve the problems of sparse text features,low classification accuracy and slow execution speed in the current big data environment.To a certain extent,the accuracy and execution efficiency of large-scale text dataset processing classification are improved.
Keywords/Search Tags:Text categorization, Association rules algorithm, Feature extension, Convolutional Neural Network, Distributed computing
PDF Full Text Request
Related items