Font Size: a A A

Completing News Classification By Related Machine Learning Algorithms

Posted on:2019-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:M M WuFull Text:PDF
GTID:2428330563993058Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
At present,with the rapid development of internet,“Machine Learning” has become a popular vocabulary among current scholars and even major commercial companies.Even worse,it has risen to the point where both students who have not yet graduated and IT engineers who have worked hard for decades have turned on the mode of machine learning.Improvements on machine learning have also been proposed.The text classification about Chinese is an important part of the theoretical field,and it is an important lesson that we still need to learn.With the explosive growth of information on internet,how to achieve the precise ordering of texts to make information retrieval easier and to meet users' needs better is the goal that we should continue to work hard for.The articles mainly adopts the data about news including nine categories from major news outlets,and it describes and discusses the related technologies about the text classification in detail.At first,this paper introduces the process of the text classification and the significance of this topic;Second,there are a lot of content about the measure of data preprocessing including the removal of noise data,text segmentation,vector space model,feature representation,feature reduction and other related techniques.In the traditional Chi-square statistic test were introduced by word frequencies and inverse document frequencies were introduced,and this method was applied to feature selection.Third the paper introduces the two kinds of classifiers: support vector machine and Naive Bayes and random forest.Finally,a complete classification system about the Chinese news was designed.The data set is divided into 70% as the training set and 30% as the test set.In the process of training,Naive Bayes algorithm and support vector machine algorithm and random forest were adopted.On the feature selection,Chi-square statistics and the improved Chi-square statistics was used,and there are six experimental groups.On the content of test,this paper adopts AUC,accuracy(10 fold cross-validation).Based on the comprehensive analysis,to a certain extent,the improved Chi-square statistics test has make the performance of support vector machine and Naive Bayes better.Besides,and the performance of random forest is best.At the end of the chapter,the reasons were analyzed.At the same time,it also pointed out the shortcomings of this article and look into the future of the text classification of Chinese.
Keywords/Search Tags:Text Classification, Vector Space Model, Support Vector Machine, Naive Bayes, Random Forest
PDF Full Text Request
Related items