Font Size: a A A

Research On Text Classification In Data Mining

Posted on:2019-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:M L ZhuFull Text:PDF
GTID:2428330566488827Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the large-scale spread of the Internet,the amount of information has grown explosively,and the demand for text information mining technology processing is increasing.Traditional manual information processing classification can no longer handle such a large amount of information.Automatic text classification technology can organize and manage a large amount of text data,and can efficiently process massive information,So it has very important research value and significance.This paper has done the following research on text categorization.Firstly,the related technologies of text classification are studied,including text preprocessing,text representation model,feature weighting,feature selection,classification algorithm and classifier evaluation performance.Use the text pre-processing method,and then use the two steps of word segmentation and stop word to get the purer text.Since the Chinese text is different from the English text,this paper introduces several common participle tools,which is convenient for the comparison of text classification results obtained by using different segmentation methods.In the corpus,this paper selects the Chinese corpus compiled by Fudan University professor Li Ronglu.Secondly,a comprehensive Chinese text feature weighted classification scheme was designed based on the relevant techniques of text classification.By analyzing the advantages and disadvantages of the traditional feature weighting method TF-IDF,the inter-class factors and intra-class factors of documents are added.Reduce the weight given by TF-IDF to lower word frequencies,and take into account some of the features that are low in frequency but can distinguish between categories to form more effective text features.Experiments show that the improved weighting scheme can effectively improve the classification effect.Finally,a feature selection method based on information gain is given.The traditional information gain feature selection method is studied and analyzed,and two factors are introduced into the inter-class document word frequency and intra-class word frequency homogeneity to form new features.The improved feature selection method and the common feature selection method are compared in different corpus.At the same time,comparative analysis is performed for different classifiers,and the respective advantages of several feature selection methods are obtained.The experiment verifies the effectiveness of the improved feature selection method in text classification.
Keywords/Search Tags:chinese text classification, feature weighting, feature selection, information gain
PDF Full Text Request
Related items