Font Size: a A A

Research And Implementation Of Text Classification For Chinese Push Messages

Posted on:2020-12-22Degree:MasterType:Thesis
Country:ChinaCandidate:J M CaiFull Text:PDF
GTID:2428330602452233Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid progress of communication technology,smartphones get more functions and more users.The mobile Internet industry is developing at high speed.It has been widely integrated into public life and has been generating large numbers of user push messages.These messages reflect the development of related industries.However,these messages are too complicated to be managed.It is an urgent problem to efficiently filter and organize these messages to excavate their potential value.This thesis mainly studies the automatic classification of Chinese push messages.It completes and improves the classification algorithm by considering the characteristics of text data.First,it researches the related technology of text preprocessing,and selects the proper word segmentation technique to divide the text of a company's mobile push messages.Next,it extracts the text features by chi-square test.These texts with reduced dimensions are translated into sparse vector.Then four kinds of text similarity calculation methods are compared in k NN algorithm.Based on the results of experiment,the cosine similarity is selected as the approach to search the nearest neighbors in the classification procedure.After that,this thesis analyzes the advantages and disadvantages of common classification algorithms including k NN and decision tree.Because of the complicated calculation and heavy time cost of k NN algorithm,an improved k NN algorithm combing with decision tree is presented,which is called TREE-k NN.CART decision tree is used for the preclassification operation of text data.The classification performance is evaluated on each leaf node of the tree.For the corresponding samples in the node with a low evaluation,their comparison scope is reduced to the child training set that only covers the text categories they belong to.Then the improved k NN algorithm is used to classify them.The scope of unclassified samples to compare and the computing times of cosine similarity are both reduced by dividing the sample space of the training set.To solve the problem that the acceleration is not obvious when k value becomes larger,this thesis introduces Rocchio algorithm into the classification procedure.The experiment shows that the classification efficiency of TREE-k NN algorithm is apparently improved compared with traditional k NN algorithm.Besides,the accuracy of classification results is also increased.Finally,based on the above classification methods,this thesis designs and implements the text mining system for large scale data to statistic the quantity distribution of push messages and visualize the statistical data.Through Spark platform,text feature selection and text vectorization are realized in a parallel way.The text segmentation and text classification processes are split into multiple data partitions for parallel execution,which improves the efficiency of task execution.After text classification,Spark is utilized to statistic the time distribution of trade messages numbers and the space distribution of logistics messages.The statistical results are stored into database.Using Web technology,the query of statistical data is encapsulated in Dubbo services.The controller module of the system sends the request to the data query service.Then it transfers the returned data to the front-end,which renders the incoming data into a Webpage graph by Echarts.The Spatial and temporal distribution of push messages can be displayed clearly in this way.
Keywords/Search Tags:Text classification, Spark, Decision tree, kNN, Web system
PDF Full Text Request
Related items