Text Classification And R Language Implementation Based On Vector Space Model

Posted on:2019-07-02

Degree:Master

Type:Thesis

Country:China

Candidate:B W Jiang

Full Text:PDF

GTID:2438330548455968

Subject:Applied statistics

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology and the popularity of Internet,text information has accumulated rapidly.Text classification can effectively solve the problem of information disorder,which is of great practical significance for efficient management and utilization of information.This paper introduces the whole process of text categorization and some implementation methods,including text data acquisition,feature extraction,feature selection,construction of weight-document matrix and text set classification.Vector space model can transform unstructured text data into structured data that can be processed by computer.It is the standard mode of text processing at present.KNN is a kind of nonparametric classification,which does not assume that the data is subject to a certain distribution and can bear certain noise.It has a high accuracy,clear concept and easy realization for the classification of unknown and non normal data.It is a widely used classification algorithm.In this paper,R language is used to compile function to collect web text data and make necessary preprocessing.After that,two methods of segmentation and non participle are used to extract feature words respectively.Mutual information is used as the index of screening feature words and the TF-IDF weight of several characteristic words with the largest mutual information is calculated to construct a word weight document matrix.Finally,the test text is classified by the KNN.According to the prediction results,the k-nearest neighbor method under the two feature word extraction methods can achieve the ideal classification effect.

Keywords/Search Tags:

Text Classification, VSM, K-Nearest Neighbors, R Software

PDF Full Text Request

Related items

1	Research On Several Pattern Classification Methods Based On K-nearest Neighbor Criterion
2	Improvement Of KNN And Its Application To Text Classification
3	Research Of Chinese Text Classification Based On KNN
4	Text Classifications Using Transductive Confidence Machine For K Nearest Neighbors
5	Continuous Pass-by Nearest Neighbors Query In Road Network
6	Research On Chameleon Clustering Algorithm Based On Nearest Neighbor
7	The Research On Adaptive Scales Spectral Clustering Based On Nearest Neighbors Path
8	Nearest Neighborhood-Based Rare Category Mining
9	Research On Multi-label Text Classification Based On Semi-Supervised Learning
10	Research On Imbalanced Data Classification Based On The Distribution Of Near Neighbors