Text Classification Based On Machine Learning

Posted on:2019-08-23

Degree:Master

Type:Thesis

Country:China

Candidate:Z Wang

Full Text:PDF

GTID:2428330566495918

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology,the text data on the Internet has grown exponentially.Traditional methods for classifying text categories have been unable to cope with the current data volume.Automatic text classification has become a research hotspot.Text classification is an important branch of text mining technology,which can effectively solve the need for automatic text classification under the development of big data.Feature selection and text classification algorithm are two key parts of text classification.This thesis mainly focuses on these two parts.In the feature selection part,this thesis proposes a mixed feature selection method(CHMI)based on chi-square statistic(CHI)and mutual information(MI).This method first introduces the word frequency for the shortcomings of the chi-square statistical method to sensitive low-frequency words.and then the adjustment parameters are used to improve the sensitivity of the mutual information method to the category sensitivity.Finally,the improved two methods are combined to obtain a hybrid feature selection method that has a good effect on low-frequency words and categories.Experimental results show that compared with the traditional chi-square statistic method and mutual information method,this method can effectively improve the accuracy of text classification on the support vector machine,naive Bayes and K nearest neighbor classifier.In the text classification algorithm,the classifier choose support vector machine.The core of the support vector machine is a kernel function.In this thesis,a mixed kernel function based on polynomial kernel function and Gaussian kernel function is proposed.The kernel function has the advantages of polynomial kernel function and Gaussian kernel function.It not only has the ability of polynomial kernel function to extract the whole feature,but also uses the Gaussian kernel function to local learning processing ability better than the whole,and overcomes the relative power of polynomial kernel function interpolation.This kernel function makes the classification result on each category more balanced.This thesis also proposes a support vector machine multi-classification algorithm using cosine similarity.This algorithm uses the cosine similarity to calculate the degree of similarity between the text to be classified and each category in the one-vs-one strategy of the support vector machine to deal with multiple classification problems.It reduces the classifiers of unrelated categories,thereby reducing the computational complexity and improving the classification accuracy.The simulation experiment also proved the feasibility and superiority of the algorithm.

Keywords/Search Tags:

Text classification, Feature selection, Support vector machine, Kernel function, Cosine similarity

PDF Full Text Request

Related items

1	Research On Text Emotion Classification Based On Improved Feature Selection Method
2	Research Of Automatic Text Classification Method Based On Machine Learning
3	Research On Text Classification Algorithm Based On Support Vector Machine And Neural Network
4	Research On Text Classification Based On Support Vector Machine With Mixture Of Kernels
5	The Selection And Improvement Of Support Vector Machine Kernels
6	The Research And Application Of Automatic Text Classifier Based On Support Vector Machine
7	Research On Text Classification Of Mixed-kernel Parallel Support Vector Machine Based On Hadoop
8	Research On Text Classification Based-on Support Vector Machine
9	Research On Method And Application Of Fuzzy Support Vector Machine With Feature Selection
10	Research On Chinese Text Categorization Based On Support Vector Machine