Completing News Classification By Related Machine Learning Algorithms

Posted on:2019-09-18

Degree:Master

Type:Thesis

Country:China

Candidate:M M Wu

Full Text:PDF

GTID:2428330563993058

Subject:Applied Statistics

Abstract/Summary:

At present,with the rapid development of internet,“Machine Learning” has become a popular vocabulary among current scholars and even major commercial companies.Even worse,it has risen to the point where both students who have not yet graduated and IT engineers who have worked hard for decades have turned on the mode of machine learning.Improvements on machine learning have also been proposed.The text classification about Chinese is an important part of the theoretical field,and it is an important lesson that we still need to learn.With the explosive growth of information on internet,how to achieve the precise ordering of texts to make information retrieval easier and to meet users' needs better is the goal that we should continue to work hard for.The articles mainly adopts the data about news including nine categories from major news outlets,and it describes and discusses the related technologies about the text classification in detail.At first,this paper introduces the process of the text classification and the significance of this topic;Second,there are a lot of content about the measure of data preprocessing including the removal of noise data,text segmentation,vector space model,feature representation,feature reduction and other related techniques.In the traditional Chi-square statistic test were introduced by word frequencies and inverse document frequencies were introduced,and this method was applied to feature selection.Third the paper introduces the two kinds of classifiers: support vector machine and Naive Bayes and random forest.Finally,a complete classification system about the Chinese news was designed.The data set is divided into 70% as the training set and 30% as the test set.In the process of training,Naive Bayes algorithm and support vector machine algorithm and random forest were adopted.On the feature selection,Chi-square statistics and the improved Chi-square statistics was used,and there are six experimental groups.On the content of test,this paper adopts AUC,accuracy(10 fold cross-validation).Based on the comprehensive analysis,to a certain extent,the improved Chi-square statistics test has make the performance of support vector machine and Naive Bayes better.Besides,and the performance of random forest is best.At the end of the chapter,the reasons were analyzed.At the same time,it also pointed out the shortcomings of this article and look into the future of the text classification of Chinese.

Keywords/Search Tags:

Text Classification, Vector Space Model, Support Vector Machine, Naive Bayes, Random Forest

Related items

1	Machine Learning Based On Structural And Spectral Features And Applications
2	Research And Improvement Of Automatic Text Classification Algorithm Based On The Vector Space Model
3	Research On Classification Method Of Random Support Vector Machine And Its Application
4	Automatic Classification Of Chinese Patents
5	Text Classification Method Based On Unsupervised Clustering And Naive Bayesian Classifier
6	Research And Implementation Of Text Classification Technology Based On Bayesian Theory
7	Chinese Text Data Classification
8	Design And Implementation Of Text Classification System For Online Quality Safety Information
9	Massive Text Classification Parallelization Technology Based On Support Vector Machine
10	Reasearch On Text Classification In The Application Of Customer Complaint Prediction Of Operator