Font Size: a A A

Tibetan Text Calssification Technology Research On Native Bayes

Posted on:2014-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2268330425470644Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Tibetan language is an important minority languages, its information technology is also rapidly advancing. The Tibetan people’s working and studying contact with a large number of Tibetan information. However, from a lot of information to identify useful informatiao is not easy. Research how to find the information we need from the mass text messages has become an important goal of the development of the computer. Tibetan text classification technology is an important part of the Tibetan information processing, it has a wide application in text filtering, document retrieval,automatic summarization.In this paper, based the Bayesian algorithms Tibetan text classification. Highlights include the construction of the Tibetan text classification corpus, the unity of the Tibetan text encoding, the Tibetan text representation,Tibetan feature selection algorithm, Tibetan naive Bayes classifier.First, This article explores the construction of the Tibetan text classification corpus, in reference to the Chinese established on the basis of the classification corpus, the Tibetan corpus combined with the existing building, this the Tibetan text classification corpus is divided into seven categories, respectively, for the economy, environment, health care, culture, education, information technology, political, religious folk.Second, for various reasons, more Tibetan type of coding the only common coding has the TongYuan code, BanZhiDa code and unicode code for the convenience of the Tibetan language processing and lay a solid foundation for the follow-up work, the Tibetan text unified unicode encoding.Forth,Feature selection, the article compares the chi-square test, and mutual information feature selection algorithm for in Tibetan text classification in effect, comparing the classification results of different classification algorithm.Fifth, consider a synonym, text classification effect has been further improved.On the basis of these work, using C#language to complete the work of the pre-treatment, and Tibetan sentenced to transcode procedures and using C++language Tibetan text classifier based on Bayesian algorithms.
Keywords/Search Tags:Tibetan text classification, Naive Bayes algorithm, featureselection, Corpus
PDF Full Text Request
Related items