Font Size: a A A

Research And Application Of Chinese Text Mining

Posted on:2015-12-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y X QiFull Text:PDF
GTID:2308330464468611Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data Mining is a mature research area and it provide a new solution to solve the problem that information overload can not be utilized effectively. Currently, data mining technology on structured data had been widely researched, but there is not so much research on mining techniques for theincreasingly growing number of text data in the network. It is very important to investigate the text mining technique, as the text data contains abundant connotation and knowledge.In addition, as the most basic application and the key technique of text mining, text classification can process and manage large amounts of text data and solve the problem of cluster information effectively. Since text classification can locate information efficiently and categorize text, it has a broad application prospects.This thesis focus on the study of C hinese text mining methods and implementation techniques. A kind of new method of text classification based on the extension of vector space model is proposed, and classifier that is realized to verify the method. The main content of this paper are as follows:1. The principle and implementation of text mining technology is introduced. Firstly, as the basis of text mining, the procedure of text preprocessing and some related algorithms are introduced. The procedure involves text representation, Chinese-word-separation, feature extraction and the calculation method of feature weight. Then, many common methods of text classification are introduced and the technical principle. The advatange and disadvantage of each method is described in detail.2. A method text classification based on feature sememe extension vector space modal is proposed. The concept of feature sememe is proposed in this paper according to the “sememe” of the How Net. The approach of feature selection from the text is improved and the vector space model is restructured. Firstly, select the text feature item of the text using modified TF-IDF method; secondly, extract the feature sememe included in feature item; finally the feature sememe is extended to get the extension feature item and produce the sememe document of this class. At last, the weight of each feature item is obtained by computing the sememe document.3. The original VSM and the synonymous VSM are introduced in the paper. The original VSM is obtained by adopting the selection method of original feature; synonymous VSM means the feature extraction is processed by the table of synonyms. The three will be integrated in order to combine with different advantages of VSM, and the integrated result will be the space of the feature item of the text. Then restructure the VSM and realize text classification.4. The method based on the extension VSM is verified by experiment. The experiment is performed by adopting different methods of feature extraction so as to get the precision rate and recall rate of the classification result combined with the comparative analysis that is also conducted in the experimental result of the presented extension VSM method. The experimental result shows that the method can improve the accuracy of the feature item selection, increase the dimensionality of effective feature vector, and then both the accuracy rate and stability of the text classification can be improved.
Keywords/Search Tags:Text Mining, Text Preprocessing, Vector Space Model, Feature Extraction, Extend VSM
PDF Full Text Request
Related items