Font Size: a A A

The Study On Chinese Text Classification Based Python

Posted on:2017-12-10Degree:MasterType:Thesis
Country:ChinaCandidate:F YaoFull Text:PDF
GTID:2347330503990894Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data, the Internet accommodates a mass of various types of data and information, most network resources appear in text or hypertext format.It is a challenge in the field of science and information technology how to organize and manage large-scale text messages effectively, and find the information you need and valid information quickly and accurately.Therefore text classification based on machine learning has become a very interesting topic.This paper selects Sohu news corpus for Chinese text categorization and conducts research in-depth.This paper introduces the research status about text categorization in foreign and domestic first; Then the paper explore and research the related technogies about text categorization, including information retrieval model, text features, word segmentation and various classification algorithms; At the same time considering the feature weight, the paper weaken influence about text length by using word frequency magnified innovatively in traditional TFIDF algorithm; The paper cited hash algorithm when processing high-dimensional and sparse text vector matrix, improving space and time efficiency of the classification project. On this basis, the paper further describes the various classification algorithms, including Naive Bayes, K-nearest neighbor, Random Forest algorithm and support vector machine algorithm.Finally, the data set is divided into 80% as the training set and 20% as the test set,then the paper fully implements the Chines text classification system using Python. Then this paper implements cross-validation, and evaluate the performance of various classification algorithms by the average accuracy, the average recall rate and ! ", and get the conclusion that SVM classification has preferable results that the average accuracy rate, the average recall rate and the return rate are up to 92%; K-nearest neighbor classification algorithm has the worst results that although the average accuracy rate is 75%,but the average recall rate and !" are only 19% and 12%.Meanwhile the paper analyzes the causes of the classification results and probe some method about improving the result of classification algorithms, and prospect the next work.
Keywords/Search Tags:Chinese-Text Categorization, VSM, Feature Weight, Classification Algorithm, PYTHON
PDF Full Text Request
Related items