Font Size: a A A

Text Categorization Algorithm Based On Machine Learning

Posted on:2020-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y L ZhuFull Text:PDF
GTID:2428330590459402Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As an important way to process documents,text categorization plays a key role in information processing,news classification,public opinion monitoring and automatic document classification.In recent decades,the theory and methods of machine learning have been improved and enriched,the related theories and achievements applied to text classification for a large amount of research results.However,against the age of big data,text data has the characteristics of large quantity,disorder and uneven distribution of topics.How to improve the accuracy of text classification is a current challenge.Text categorization needs feature selection,text representation and classifier model construction.There are some shortcomings in the use of algorithms.Therefore,this paper will study the text categorization algorithm from these three aspects.(1)Aiming at term frequency feature selection algorithm did not consider the correlation between feature items and categories in extracting feature items,this paper proposes a text classification algorithm based on word similarity and term frequency hybrid features.The algorithm calculates the similarity values of all the entries of each type of text in the text set and those in the corresponding category feature list.When the calculated values are larger than the pre-set similarity values,they are reserved as content,otherwise they are not reserved.After calculating the similarity values of entries in all text sets,feature subsets with strontg correlation between categories are extracted by term frequency,and feature items which have great influence on classification are eliminated.Experimental results verify the effectiveness of the improved algorithm.(2)Aiming at the problem that traditional VSM has too high dimensionality,very sparse vectorization representation and is incapable of representing the semantics of documents,so this paper proposes an improved vector space model based on TF-IDF and Word2vec.The model is used to calculate all the words through the text preprocessing on the set of Word2vec and carries out TF-IDF weight calculation from feature extraction of each document.Finally,the whole text is represented as a space vector by combining the weight of feature items and word vectors.The experiment testifies the validity of the improved model.(3)Aiming at the decision hyperplane generated by SoftMax regression linear model belongs to linear model,and text categorization has the characteristics of non-linearity.Using this model to categorize text will affect the accuracy of text categorization.In this paper,a non-linear SoftMax regression text categorization algorithm is achieved by extending the square term and the multiplication term of Category attribute Xn.It transforms the linear hypothesis in the decision function into the non-linear hypothesis for the sake of receiving the non-linear model,and then obtains the non-linear segregated hyperplane to improve the accuracy of text categorization.Experimental results demonstrate the effectiveness of the improved algorithm.
Keywords/Search Tags:Text Categorization, Frequency Feature Selection, Word Similarity, Word2vec Model, SoftMax Regression
PDF Full Text Request
Related items