Font Size: a A A

Design And Implementation Of Uyghur Text Classifier Based On Generalized Information Entropy

Posted on:2018-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z C TangFull Text:PDF
GTID:2348330542450086Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the big data era,the information of uyghur language on the Internet also increased rapidly,so the uyghur text classification technology is the key to deal with these large amount of text data.The most characteristic and difficulty of the automatic classification of uyghur text is the high complexity of feature space and time of classification.In the existing uyhgur text classification research,usually use the word as the smallest independent semantic carrier,namely the classification characteristics,if not processing feature space,and may include all the entries of the article,and the total number of entries Uighur almost impossible to measure accurately,so the high dimensional feature space for all of the classification algorithm are enormous strength calculation,seek a conform to the uyghur words distribution of effective feature selection method,has the characteristics of strong distinction ability and reduce the dimension of feature space,thus effectively reduce the time complexity of classification algorithm,improve the accuracy of classification,become the primary problem of the uyghur text automatic classification.Therefore,the fast and accurate uyghur text classification algorithm becomes the urgent demand in this field.In this paper,according to the above problem,consider the anti-interference in text classification,on the premise of effectively improve the real-time performance of the algorithm,to ensure the accuracy of text categorization,which was designed and implemented based on Generalized Information Entropy and genetic algorithm of Decision Tree algorithm---GIE-FDT(Generalized Information Entropy Fused with Decision Tree)algorithm,the algorithm was adopted to realize the uyghur text classifier.In order to meet the real-time requirement of text classifier,this paper reduces the time cost of the algorithm from three aspects:1)using the generalized information entropy as the text classification feature greatly reduces the dimension of the feature space and effectively reduces the time cost of the algorithm.2)the algorithm of this paper has unified the feature calculation and model training,avoiding the calculation of the text characteristics and the secondary processing of model training,which greatly reduces the time cost.2)the system in the training corpus,no obvious changes and has a certain representativeness,training of the model has generality for the first time,subsequent use without retraining model when the system classification,to satisfy the real-time requirements of text categorization.On the basis of the real-time requirement,it is necessary to ensure the accuracy of the classification algorithm.In order to effectively improve the accuracy of the algorithm,we use generalized information entropy comprehensive consideration the differences between class and class,can more accurate classification,using the genetic algorithm at the same time,avoid the interference of noise with the result of the experiment,improve the anti-jamming of the proposed algorithm,However,the genetic algorithm has the shortcoming of training time,so the stochastic gradient descent algorithm is introduced to accelerate the model training process?Based on the multifaceted performance of the anti-interference,real time and accuracy of GIE-FDT algorithm,the following conclusions are obtained:GIE-FDT algorithm using the generalized information entropy as text classification characteristics,effectively reduce the dimension of feature space and the strength calculation,at the same time,the characteristics of the calculation with the model training integrated into the same process at one time,meet the requirement of real-time.GIE-FDT algorithm based on genetic algorithm for model training and avoid falling into local optimum situation,be able to get the result of the approximate optimal,satisfy the requirement of accuracy.GIE-FDT algorithm based on genetic algorithm to the model parameters are dynamic feedback correction,avoid noise interference training focus,meet the requirements of anti-jamming.GIE-FDT algorithm in the training corpus,there are no major changes,and has certain representativeness,the second time when using the system doesn't need retraining model,can be used directly for the first time the training model,greatly reduces the time cost,satisfy the requirement of more application scenario.By implementing a stochastic gradient descent algorithm,to accelerate the process of model training,the training time greatly reduced model.A large number of experiments show that the GIE-FDT algorithm designed and implemented in this paper can meet the real-time requirements,and also has high accuracy and anti-interference.The research work of this paper has some theoretical value,which has reference and reference significance for similar work.
Keywords/Search Tags:decision tree, generalized information entropy, uyghur language, text classification, genetic algorithm, stochastic gradient descent algorithm
PDF Full Text Request
Related items