Font Size: a A A

Research On Chinese Text Classifier Based On Probability Method

Posted on:2017-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y M XieFull Text:PDF
GTID:2348330491950437Subject:computer science and Technology
Abstract/Summary:PDF Full Text Request
The development and application of Internet make the number of document data increase swiftly, it's a difficult task to obtain valuable information. The method of text classification is used to handle these document data. Feature selection and classification algorithms are the main parts of this technique. Feature selection is the basic work which can reduce dimension and remove bad words for text classification. The classifier is also importmant and it will affect classification efficiency directly.In this paper, it analyses the defects of CHI-square statistic and Naive Bayes Classifier, a Chinese text classifier based on probabilistic is proposed,it improves the efficiency of classification from two aspects. First, Traditional CHI-square feature selection method does not take into account the number of categries that words appear, the frequency of words and documents, and the intra-class and inter-class distributional information of words and documents in highly skewed datasets, so it can not select effective words. This paper proposes a new method of feature selection based on probability. It measures the frequency of words and documents by probability. Then use it to calculate the frequency factor of categories, the concentration factors of words and documents between all categories, the balance factor of words in each category.Lastly, our method adjust the value of CHI-square based on these factors. The adjusted CHI-square is able to select more efficient words for different categories. The experimental results show that the proposed method improves precision, recall and macroscopic F1 on skewed datasets. Second, Navie Bayes Classifier can not distinguish the contribution of each word, and the common improved methods can not compute the weight of feature comprehensively. This paper proposes an improved Navie Bayes Classifier based on probability, it uses the new CHI-square value got before to weight the Navie Bayes when computes posterior probability. The experimental results show that the proposed method improves the efficiency of classifier.Through the improvement work above, it improves the CHI feature selection on skewed datasets and enhances precision of Naive Bayes Classifier.
Keywords/Search Tags:CHI-square statistic, probability method, skewed datasets, Naive Bayes Classifier, text categorization
PDF Full Text Request
Related items