Font Size: a A A

Research On Stop Words And Feature Selection For Text Classification

Posted on:2015-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:Z T MaFull Text:PDF
GTID:2308330464468660Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the 21 st century, we have stepped into the information age, more and more information appears in the form of electronic documents. Automatic text classification can help people get what they want more accurately, therefore, the study of text categorization is meaningful and valuable.in this paper we discuss the key techniques of automatic text classification: Text preprocessing and feature selection.In text preprocessing stage, the select of stop words can influence the classification accuracy and efficiency. Traditional stop words are selected by people though their experience and put the words into a set called stop-word list. The stop words get by this method does not take the characteristics of the corpus into account. In this paper, we use coefficient of variation to describe the discrete degree of a word in categories that contains the word. In the new method we statistical the document frequency of word in different class, and calculate the coefficient of variation, then, we set the minimum category number which include the word, to determine whether a word is stop-word or not. Experimental results show that the new method can get different set of stop words in different corpus, which has strong adaptability.In feature selection stage, good feature selection methods can pick out more meaningful words, which can improve the classification accuracy. Traditional chi-square test has the following defects: First, traditional chi-square test only considers whether words appear in a document or not, doesn’t take the word frequency into account which may lead the method to choose low-frequency words. Second, if the document frequency of a word in specified category is small and in other categories is big, the chi-square test will give this kind of words high value, which leads to a negative correlation.According to the disadvantages of chi-square test, the standard score and feature distribution inside category are introduced to solve the disadvantages respectively, an improved chi-square test method is proposed. In this new method we use a standard score to describe the distance between the document frequency of a word in specific category and the average document frequency of the word in total categories, use a feature distribution inside category to describe the distribution of a word in specific category. At last, we verify the effective of the improved method though experiment.
Keywords/Search Tags:text classification, stop words, chi-square test, standard score
PDF Full Text Request
Related items