Application Of Text Categorization Algorithm In Practical Modeling

Posted on:2018-10-30

Degree:Master

Type:Thesis

Country:China

Candidate:B W Tian

Full Text:PDF

GTID:2428330620453551

Subject:Applied statistics

Abstract/Summary:

With the advent of the mobile Internet era,content innovation and resource aggregation is more and more rich,while the amount of data and data content is also more and more diverse.In domestic,the major video site giant carrying users daily upload and playback needs have been geometric level growth,but also makes the video site control becomes increasingly difficult.How to control the video content and create a better network environment under the premise of the huge amount of users,follow the relevant laws and regulations of the country,and do not play the initiative of the user premise,is the video site has always been the most important issue.In recent years,the number of video sites is growing rapidly.In the large amount of data,artificial methods and traditional models have been unable to solve the current problems.Based on the data of a video website,this paper uses the related technology of data mining to establish the model,and tries to establish a mathematical model which can automatically judge the vulgar video.Artificial method has high accuracy and time consuming;traditional method can solve small data problem,but the accuracy rate is not high.The purpose of this model is to be able to respond quickly to mass data and to achieve a higher level of accuracy.To achieve a balance between time and accuracy.In this paper,the text classification method is used to segment and filter the video title and video tag,and then the text representation model is used to represent the text,such as Boolean model,probability model and vector space model.Text and other unstructured data into the structure of the data can be modeled;Next,this paper attempts to use the Naive Bayes method to deal with the probability model and the use of logistic regression to deal with the vector space model data,get better results.At the same time,in order to make the model achieve the purpose of rapid response under the massive data,this paper then uses the feature selection method,followed by the improved chi-square test method and the random forest-based machine learning feature selection method.The results show that the text after the feature selection can effectively remove the noise,and can greatly improve the response speed of the classification model.

Keywords/Search Tags:

data mining, text classification, feature selection, vector space model, Random Forest

Related items

1	Research On Feature Selection And Classification Method Based On Random Forest For Medical Datasets
2	Research On Optimization Of Random Forest Algorithm And Its Application In Text Parallel Classification
3	Research On Random Forest Algorithm Based On Feature Selection And Diversity
4	Research And Application In Text Classification Based On Random Forest
5	Research On Imbalanced Data Classification Method Based On Random Forest Algorithm
6	Research On Text Classification Of Web Data Mining
7	Completing News Classification By Related Machine Learning Algorithms
8	On Research For Chinese Automatic Text Categorization Technology Based On VSM Model And Feature Selection
9	Research Of Text Categorization Base On Vector Space Model And Association Rules
10	Study On The Application Of Random Forests In Text Classification