Font Size: a A A

Research On Text Classification Method Based On Part Of Speech Tagging LDA Model

Posted on:2016-10-25Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2308330464473827Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification is a main branch of text mining area. Improving the speed and accuracy of text classification is always the goal and the pursuit of the researchers. The main work of this thesis is as follows:1. This thesis analyzes the current research status of text classification techniques. Text classification method based on the LDA model use the text—topic distribution information as the text features, but this method can’t make efficient use of the information such as part-of-speech information. In order to improve the text classification speed and accuracy, this thesis designs to add parts of speech information in the LDA model. This thesis puts forward a text classification method based on PSTLDA (part of speech tagging LDA) model. First, the words in the text were divided into noun set, verb set and other set according to the part of speech; Second, modeling each set to a LDA model; Finally, determine the weight of each model in the comprehensive model scale according to the different contribution to the text similarity calculation of the different part of speech of the word and use PST_LDA model for text feature extraction and calculate the text similarity and make use of KNN method to classify the texts. The introduction of part of speech information improves the text classification accuracy.2. This thesis proposes a parallel processing method of the text classification based on PST_LDA model. In order to reduce the fetch times of small text in parallel storage environment, use the "Sequence File" to format multiple small files into a large file in <file name, file content> format to improve the efficiency of each access to the file. In the process of modeling, this thesis proposes a parallel processing method of modeling work of LDA for different word set, which reduces the modeling time consumption and increases the speed of text classification. PST_LDA model is used to extract features of text and the idea of "data parallel" is used to implement the parallelization of KNN classification method, which is the method of text classification.3. In one computer environment, selecting TanCorp-12 data sets as experimental data set, using LDA method and PST_LDA method respectively to do feature extraction, then using KNN method to classify the texts. The experimental results show that the PST_LDA method can improve the classification accuracy and the classification speed, the macro_F value increased by 2.3% and the runtime of modeling decreased by 27.5%.4. In the Hadoop environment, we make a parallel experiment for the text classification method based on PST_LDA model. The experimental results show that in Hadoop environment, the modeling time is only 44.2% of runtime in the single machine environment and the text categorization time is just 54.1% of the time in the one computer environment.
Keywords/Search Tags:text classification, PST_LDA, part-of-speech tagging, parallel process
PDF Full Text Request
Related items