The Study On Chinese Text Classification Based Python

Posted on:2017-12-10

Degree:Master

Type:Thesis

Country:China

Candidate:F Yao

Full Text:PDF

GTID:2347330503990894

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

With the advent of the era of big data, the Internet accommodates a mass of various types of data and information, most network resources appear in text or hypertext format.It is a challenge in the field of science and information technology how to organize and manage large-scale text messages effectively, and find the information you need and valid information quickly and accurately.Therefore text classification based on machine learning has become a very interesting topic.This paper selects Sohu news corpus for Chinese text categorization and conducts research in-depth.This paper introduces the research status about text categorization in foreign and domestic first; Then the paper explore and research the related technogies about text categorization, including information retrieval model, text features, word segmentation and various classification algorithms; At the same time considering the feature weight, the paper weaken influence about text length by using word frequency magnified innovatively in traditional TFIDF algorithm; The paper cited hash algorithm when processing high-dimensional and sparse text vector matrix, improving space and time efficiency of the classification project. On this basis, the paper further describes the various classification algorithms, including Naive Bayes, K-nearest neighbor, Random Forest algorithm and support vector machine algorithm.Finally, the data set is divided into 80% as the training set and 20% as the test set,then the paper fully implements the Chines text classification system using Python. Then this paper implements cross-validation, and evaluate the performance of various classification algorithms by the average accuracy, the average recall rate and ! ", and get the conclusion that SVM classification has preferable results that the average accuracy rate, the average recall rate and the return rate are up to 92%; K-nearest neighbor classification algorithm has the worst results that although the average accuracy rate is 75%,but the average recall rate and !" are only 19% and 12%.Meanwhile the paper analyzes the causes of the classification results and probe some method about improving the result of classification algorithms, and prospect the next work.

Keywords/Search Tags:

Chinese-Text Categorization, VSM, Feature Weight, Classification Algorithm, PYTHON

PDF Full Text Request

Related items

1	Chinese Text Categorization Method And Implementation
2	The Method Of Selecting Local Feature Words And Its Application In Text Classification
3	Realization Of Text Classification And Recognition Based On NLP Method
4	Statistical Classification Analysis For High-dimensional Data
5	Research On Semantic Classification Model Of Teaching Evaluation Based On Feature Weighted Stacking Algorithm
6	A Text Classification Based On The Recurrent Neural Networks
7	Retrieval Text Classification Based On Recall And Sort
8	Research Of SVM Kernel Functions In Text Classification
9	Algorithm For Mining Frequent Itemsets And Its Optimization
10	Improved Naive Bayes Algorithm With Application To Text Classification