Font Size: a A A

Text Classification Based On Machine Learning

Posted on:2019-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:Z WangFull Text:PDF
GTID:2428330566495918Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,the text data on the Internet has grown exponentially.Traditional methods for classifying text categories have been unable to cope with the current data volume.Automatic text classification has become a research hotspot.Text classification is an important branch of text mining technology,which can effectively solve the need for automatic text classification under the development of big data.Feature selection and text classification algorithm are two key parts of text classification.This thesis mainly focuses on these two parts.In the feature selection part,this thesis proposes a mixed feature selection method(CHMI)based on chi-square statistic(CHI)and mutual information(MI).This method first introduces the word frequency for the shortcomings of the chi-square statistical method to sensitive low-frequency words.and then the adjustment parameters are used to improve the sensitivity of the mutual information method to the category sensitivity.Finally,the improved two methods are combined to obtain a hybrid feature selection method that has a good effect on low-frequency words and categories.Experimental results show that compared with the traditional chi-square statistic method and mutual information method,this method can effectively improve the accuracy of text classification on the support vector machine,naive Bayes and K nearest neighbor classifier.In the text classification algorithm,the classifier choose support vector machine.The core of the support vector machine is a kernel function.In this thesis,a mixed kernel function based on polynomial kernel function and Gaussian kernel function is proposed.The kernel function has the advantages of polynomial kernel function and Gaussian kernel function.It not only has the ability of polynomial kernel function to extract the whole feature,but also uses the Gaussian kernel function to local learning processing ability better than the whole,and overcomes the relative power of polynomial kernel function interpolation.This kernel function makes the classification result on each category more balanced.This thesis also proposes a support vector machine multi-classification algorithm using cosine similarity.This algorithm uses the cosine similarity to calculate the degree of similarity between the text to be classified and each category in the one-vs-one strategy of the support vector machine to deal with multiple classification problems.It reduces the classifiers of unrelated categories,thereby reducing the computational complexity and improving the classification accuracy.The simulation experiment also proved the feasibility and superiority of the algorithm.
Keywords/Search Tags:Text classification, Feature selection, Support vector machine, Kernel function, Cosine similarity
PDF Full Text Request
Related items