Font Size: a A A

The Text Classification Research And Its Application Based On LDA

Posted on:2017-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:J R ZhangFull Text:PDF
GTID:2308330485480421Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology and the Internet, now we have already entered the information times.Every day there are a large number of data in a variety of social media platforms or news sites, which is usually the text data. How can we get the information we need from these voluminous text data, has become a growing concern, also stimulates the production and rapid development of automatic text classificat0 ion technology. And in recent years, the text classification has become one of hotspots and difficulty in research of natural language processing. A lot of scholars did a mountain of research work and contributions in the field of text classification. In the late 1950 s, the H.P.Lunhn made a pioneering contribution for the text classification work because they first advanced the idea of counting the word’s frequency. Maron published the first article on automatic text classification in 1960, which promoted the development of text classification.Then more scholars joined the research of this field.This paper firstly learns the story of the thesis research background and its significance.Then analyses the relevant theoretical knowledge of text classification and some of its relevant techniques and methods, and points out their shortcomings and deficiencies. On this basis, This paper has made the following contributions:1. this paper puts forward aweakly supervision of text classification algorithm VB- LDA(Latent Dirichlet Allocation with the Vector and Bigram) based on LDA. In this algorithm, firstly the LDA probability generation model is improved. The LDA model does not consider word order in the text, which is a pure bag of word model, and the words adjacent are independent, unrelated. The improved LDA document generation model first keeps the word order in the document, and on this premise, joins the bigram grammar in the process of the generation of a document, which introduces a state of random variable x between two adjacent words. The x is used to indicate whether the two adjacent words to form a bigram grammar. Then, the VB-LDA algorithm also introduces the word vector. In the original LDA model, when we obtain the high frequency words of each topic, the class label of a topic can be determined by them. It is usually thedomain experts who decide the class label which a topic belongs to. But this article is not the same. So this paper introduces a quantitative tool word2 vec which is used to put a word into a word vector.2. The VB-LDA algorithm is applied to text classification. In this article, the main idea of using algorithm VB-LDA to classify the documents is as follows, first use the improved LDA to model the training data, in order to acquire the high frequency word of each topic and at the representative words for every class label. Then we take advantage of tool word2 vec to transform each topic’s frequency words and each class label’s representative words into word vectors. At last, we utilize the distance measure to calculate the class label that the main topic of a document belongs to, which is the document belongs to.Finally the results in the public data set composed of 20 newsgroup, WebKB and SRAA denotes that the text classification algorithm VB-LDA based on LDA in this paper also has very close to the classification ability with the SAM algorithm when it does not need the artificial labeled training data; at the same time, the VB-LDA algorithm can obtain better performance than other text classification algorithm not based on LDA as well.
Keywords/Search Tags:text classification, LDA, topic, word order, bigram grammar, word vector
PDF Full Text Request
Related items