Font Size: a A A

Study On The Text Classification Feature Selection Method-the Uyghur Language

Posted on:2018-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:J B HanFull Text:PDF
GTID:2348330533456490Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the holding of the Fifth Plenary Session in 2015,China had rose big data to national strategy,"Big Data Open The Smart Era" has become the trend of the development of the times.Big data era is created by the Internet,the rapid development of the Internet result in the data explosion.With such a large amount of data,it brings people both opportunities and challenges.A lot of valuable information is overwhelmed by a lot of useless data,making it difficult for people to get their necessary and valuable information,so it is important for people to dig out valuable information from a large number of dates.Text classification is the key to the field of data mining,also,it is the core of data mining,and the methods of feature selection plays an indispensable role in the text classification,so it is necessary to study of feature selection methods.Xinjiang is a multi-ethnic development area,Uygur language in Xinjiang is very common,the development of the Internet for Uighur development has brought opportunities and challenges.Uygur text classification technology is in line with the trend of the development of the times,for this reason,Uyghur classification technology is extremely important,and this paper aimed at the Uygur language feature selection methods for research and analysis.The main contents and achievements of this paper as follows:(1)On the basis of deep research and analysis of the deficiency of traditional information gain and the complex characteristics of Uyghur language,a method of selecting Uighur feature selection is proposed to improve the information gain.In this paper,the traditional information gain is modified from four aspects.Firstly,the traditional information gain is modified by combining the word frequency and the feature distribution coefficient and the inverted video frequency.Then,an alternative feature distribution coefficient is introduced to balance the feature selected between classes number.Experiments show that the improved information gain algorithm is effective,and the selected features are relatively uniform in each class distribution,and the feature area is obviously and effectively overcome the shortcomings of traditional information gain.(2)Aiming at the advantages and disadvantages of traditional information gain and chi-square,a hybrid improved Uyghur feature selection method is proposed.The method firstly evaluates the feature items in units of categories.Secondly,the evaluation values are normalized to avoid the evaluation values being too large or too small to facilitate the analysis.Then,the word frequency is used to make up for the word frequency that the two algorithms do not consider defect.Finally,the introduction of two regulatory factors,it is easy to adjust the two different proportion of the algorithm to make it more practical.In the Uygur data set,the results show that the mixed feature selection method can effectively improve the classification of the indicators,and the robustness of this method is strong.(3)Based on the feature selection method,we studied the classical particle swarm algorithm,and the particle swarm optimization algorithm is improved according to the characteristics of Uygur language,and the algorithm is applied to the Uygur feature selection.Firstly,on the basis of the third chapter,the traditional information gain is used to rough the feature,which reduces the feature dimension.Secondly,the quadratic function is introduced into the change of the inertia weight,so that the process of particle search has different weights,the ability of development and exploration is fully reflected.Then,the number of feature selection is introduced into the fitness function and the dimension of the feature is reduced.Finally,the improved particle algorithm is analyzed in the Uyghur sample set.The results show that the discriminant degree of feature subsets after IG rough selection and improved PSO is clear,it improves the classification performance of Uyghur text to a certain extent.
Keywords/Search Tags:uyghur, Feature selection, Information gain, Chi-square statistics, Particle swarm optimization
PDF Full Text Request
Related items