Font Size: a A A

Application Of Text Categorization Algorithm In Practical Modeling

Posted on:2018-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:B W TianFull Text:PDF
GTID:2428330620453551Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
With the advent of the mobile Internet era,content innovation and resource aggregation is more and more rich,while the amount of data and data content is also more and more diverse.In domestic,the major video site giant carrying users daily upload and playback needs have been geometric level growth,but also makes the video site control becomes increasingly difficult.How to control the video content and create a better network environment under the premise of the huge amount of users,follow the relevant laws and regulations of the country,and do not play the initiative of the user premise,is the video site has always been the most important issue.In recent years,the number of video sites is growing rapidly.In the large amount of data,artificial methods and traditional models have been unable to solve the current problems.Based on the data of a video website,this paper uses the related technology of data mining to establish the model,and tries to establish a mathematical model which can automatically judge the vulgar video.Artificial method has high accuracy and time consuming;traditional method can solve small data problem,but the accuracy rate is not high.The purpose of this model is to be able to respond quickly to mass data and to achieve a higher level of accuracy.To achieve a balance between time and accuracy.In this paper,the text classification method is used to segment and filter the video title and video tag,and then the text representation model is used to represent the text,such as Boolean model,probability model and vector space model.Text and other unstructured data into the structure of the data can be modeled;Next,this paper attempts to use the Naive Bayes method to deal with the probability model and the use of logistic regression to deal with the vector space model data,get better results.At the same time,in order to make the model achieve the purpose of rapid response under the massive data,this paper then uses the feature selection method,followed by the improved chi-square test method and the random forest-based machine learning feature selection method.The results show that the text after the feature selection can effectively remove the noise,and can greatly improve the response speed of the classification model.
Keywords/Search Tags:data mining, text classification, feature selection, vector space model, Random Forest
PDF Full Text Request
Related items