Font Size: a A A

The Research Of Tibetan Text Classification Algorithms For The Analysis Of Network Public Opinion

Posted on:2015-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:A L LiFull Text:PDF
GTID:2298330467974443Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Tibetan text categorization aims to accurately classify text information category and is the basic research in the field of natural language processing. Algorithm of Tibetan text classification is an important part of the research based on Tibetan web public opinion analysis. According to the characteristics of Tibetan web text information in this paper, by building the Tibetan network text corpora, combining with a lot of data analysis and experimental. Do more depth research on Tibetan text representation, the method of text feature extraction, the algorithm of Tibetan text classification and many other technologies. The main of thesis as follows:(1) Summary the feature of Tibetan online texts in internet and find the suitable web crawler which collects Tibetan information. Set up the training corpus and testing corpus using artificial marking method.There have a few Tibetan sites, blogs and forums in internet, therefore, the relative concentration of the distribution of Tibetan online texts. In this paper, multi-threaded crawler strategy is used. This paper established the category system according to the category of Tibetan network data, set up nearly3000training corpus and test classification and testing corpus using artificial marking method.(2) According to the characteristics of Tibetan text information find suitable preprocessing technology for Tibetan text information combine with Chinese text preprocessing methods.For the Tibetan own characteristics and grammatical structure, using vector space model for text representation and by establishing Tibetan stoplist to remove stop words, using bag of words model for feature selection; using the tf*idf algorithm for calculating the weights of feature items.(3) By analyzing and summarizing Tibetan text classification techniques this paper proposes ensemble learning classification algorithm through integrating naive bayes algorithm and support vector algorithm.This paper mainly proposes ensemble learning Tibetan text classification algorithm based on Naive Bayesian and Tibetan text algorithms based on the svm. On the basis of comparison of their merits of two algorithms and the fusion of the advantage of naive bayesian and svm achieve ensemble learning. The method is possible to effectively classify Tibetan text. Through experiment and analyze the small-scale corpus, cultural education classes as an example, obtain precision, recall and F1comprehensive test values were69.3%,72.3%,70.9%, and increased by4.7%,5.2%points relative to the individual Bayesian algorithms, support vector algorithm.
Keywords/Search Tags:web public opinion, Tibetan text categorization, svm, naive bayesian, ensemble learning
PDF Full Text Request
Related items