Font Size: a A A

Text Classification And R Language Implementation Based On Vector Space Model

Posted on:2019-07-02Degree:MasterType:Thesis
Country:ChinaCandidate:B W JiangFull Text:PDF
GTID:2438330548455968Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and the popularity of Internet,text information has accumulated rapidly.Text classification can effectively solve the problem of information disorder,which is of great practical significance for efficient management and utilization of information.This paper introduces the whole process of text categorization and some implementation methods,including text data acquisition,feature extraction,feature selection,construction of weight-document matrix and text set classification.Vector space model can transform unstructured text data into structured data that can be processed by computer.It is the standard mode of text processing at present.KNN is a kind of nonparametric classification,which does not assume that the data is subject to a certain distribution and can bear certain noise.It has a high accuracy,clear concept and easy realization for the classification of unknown and non normal data.It is a widely used classification algorithm.In this paper,R language is used to compile function to collect web text data and make necessary preprocessing.After that,two methods of segmentation and non participle are used to extract feature words respectively.Mutual information is used as the index of screening feature words and the TF-IDF weight of several characteristic words with the largest mutual information is calculated to construct a word weight document matrix.Finally,the test text is classified by the KNN.According to the prediction results,the k-nearest neighbor method under the two feature word extraction methods can achieve the ideal classification effect.
Keywords/Search Tags:Text Classification, VSM, K-Nearest Neighbors, R Software
PDF Full Text Request
Related items