Font Size: a A A

Research On Text Classification Technology Based On Machine Learning

Posted on:2020-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y XingFull Text:PDF
GTID:2428330590453153Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,network data and resources are rapidly growing.How to effectively classify,manage and utilize such information has become a hot research topic.As an important basis of information retrieval and data mining,text classification technology has been widely used in the fields of content information filtering,natural language processing and understanding,and news classification.The text classification technology based on machine learning,which based on statistical theory,firstly uses algorithm to perform statistical analysis on known training data to obtain the rule,and then the rule is used to predict and analyze the unknown data.This thesis studies the text classification technology based on machine learning.Firstly introduce the general process of text classification and text preprocessing,text representation,spatial dimension reduction,classification method and classification performance evaluation.Based on this,the random forest algorithm in machine learning classification algorithm is selected for key research.We analysis the problems and optimization space of the algorithm,and optimizes it.On the one hand,for the problem of neglecting the strong and weak classifiers in the traditional random forest voting,the voting mechanism of the algorithm is optimized.Firstly,based on the classification effect of the decision tree,the decision tree is given corresponding weights,and then combined with the probability that the samples output by the decision tree belong to each class to perform a weighted probability vote.On the other hand,the value of hyperparameters in random forest algorithm has a great influence on the performance of the algorithm,and when the algorithm is applied to text classification the number of hyperparameters is large and the range of values is large.A hyperparameter optimization algorithm combining random search algorithm and grid search algorithm is proposed to solve the problem of parameter optimization of hyperparameters.A text classification experiment based on Python is designed for the proposed algorithm,the random forest algorithm which optimizes the voting mechanism is compared with the traditional random forest algorithm,and verify the effectiveness of the random forest algorithm that optimizes the voting mechanism and hyperparameter selection.The random forest optimization algorithm proposed in this thesis can improve the classification performance of the algorithm.And the hyperparameter optimization algorithm has certain reference significance for the hyperparameter optimization problem of machine learning algorithm,especially for the hyperparameter optimization problem in text classification.
Keywords/Search Tags:machine learning, text classification, random forest, weighting mechanism, hyper parametric optimization
PDF Full Text Request
Related items