Study Of Chinese Text Classification

Posted on:2011-07-18

Degree:Master

Type:Thesis

Country:China

Candidate:W L Lv

Full Text:PDF

GTID:2178330305460423

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology and the popularity of Internet, information resources increase dramatically. How to manage and use the large quantity of information resources effectively and provide convenience for the fast, comprehensive, accurate retrieval for users becomes a problem to be solved. As an important information processing technology, text classification provides solutions for the information resources'fast, comprehensive, accurate organization and use, and provides users with efficient retrieval. Therefore, text classification has become an increasingly important research area.This paper concentrated on the key technologies of Chinese text of the classification, including the Chinese words cut, text representation, feature selection and weight calculation, text classification algorithm, classifier performance evaluation, etc. Then, this paper researched deeply against the two core issues in classification process-feature selection and weighting algorithm and text categorization algorithm. Feature selection and weight calculation means selecting terms which best represent the text content and weighting them to removing the negative effects and promoting the speed and accuracy of text classification; the classification algorithm is good or bad will directly affect the classification results.There are two innovations in this paper:in weight calculation, this paper analyzed the two defects of classic feature weighting method TFIDF:Dataset deflection and distribution differences between classes and in class. And then, aimed at the second defect, MI feature selecting algorithm was introduced into weight calculation to resolve distribution differences between classes, and use formula correction factor to make up the difference within a class; in classification algorithm design, SE-Bagging KNN algorithm was designed on the basis of the previous research results. The algorithm introduced selective ensemble into Chinese text classification, based on property sensitive of KNN algorithm, using repeatable sampling technique to generate new training sets, which was used to train KNN weak classifier, and then used variable similarity clustering technology to select variation and better classifiers to classify new texts. Experiment was done on.Net platform, using WordSeg component and Weka data mining package, and the result proved that compared with traditional classification algorithm, SE-Bagging KNN text classification algorithm has obvious improvement in classification quality.

Keywords/Search Tags:

text classification, feature selection, weight calculation, selective ensemble, variable similarity clustering

PDF Full Text Request

Related items

1	Research On Feature Selection Method Based On Clustering Ensemble
2	Text Sentiment Analysis Based On Text Classification
3	The Design And Implementation Of Text Classification System Based On SVM-KNN
4	Research On Text Classification Based On Rough Set
5	Research On Selective Clustering Ensemble Based On Cluster Validity Index
6	Research On Selective Clustering Ensemble Algorithm Based On Fractal Dimension
7	Improved CHI Method On Text Feature Selection
8	Chinese Text Clustering Based On Text Similarity
9	Reasearch On Text Classification In The Application Of Customer Complaint Prediction Of Operator
10	Feature Selection For Unbalanced Data And Emotional Dictionary Building