Research On Stop Words And Feature Selection For Text Classification

Posted on:2015-02-06

Degree:Master

Type:Thesis

Country:China

Candidate:Z T Ma

Full Text:PDF

GTID:2308330464468660

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In the 21 st century, we have stepped into the information age, more and more information appears in the form of electronic documents. Automatic text classification can help people get what they want more accurately, therefore, the study of text categorization is meaningful and valuable.in this paper we discuss the key techniques of automatic text classification: Text preprocessing and feature selection.In text preprocessing stage, the select of stop words can influence the classification accuracy and efficiency. Traditional stop words are selected by people though their experience and put the words into a set called stop-word list. The stop words get by this method does not take the characteristics of the corpus into account. In this paper, we use coefficient of variation to describe the discrete degree of a word in categories that contains the word. In the new method we statistical the document frequency of word in different class, and calculate the coefficient of variation, then, we set the minimum category number which include the word, to determine whether a word is stop-word or not. Experimental results show that the new method can get different set of stop words in different corpus, which has strong adaptability.In feature selection stage, good feature selection methods can pick out more meaningful words, which can improve the classification accuracy. Traditional chi-square test has the following defects: First, traditional chi-square test only considers whether words appear in a document or not, doesnâ€™t take the word frequency into account which may lead the method to choose low-frequency words. Second, if the document frequency of a word in specified category is small and in other categories is big, the chi-square test will give this kind of words high value, which leads to a negative correlation.According to the disadvantages of chi-square test, the standard score and feature distribution inside category are introduced to solve the disadvantages respectively, an improved chi-square test method is proposed. In this new method we use a standard score to describe the distance between the document frequency of a word in specific category and the average document frequency of the word in total categories, use a feature distribution inside category to describe the distribution of a word in specific category. At last, we verify the effective of the improved method though experiment.

Keywords/Search Tags:

text classification, stop words, chi-square test, standard score

PDF Full Text Request

Related items

1	The Study Of Comparison Between Mongolian Stop Words And English Stop Words
2	Study On Chinese Text Sentiment Classification
3	The Impact Of Mongolian Stop-List And Stemming On Mogolian Text Categorization
4	Short Text Classification Research Based On Sina Weibo
5	Research On Manifold Learning Based On The Text Classification
6	Research On The Approach Of Micro-blog Text Preprocessing And User Interest Modeling
7	Improvement Of K-means Algorithm And Its Application In The Text Data Cluster
8	Research On Dataless Text Classification With Seed Words: A Supervised Topic Modeling Approach
9	Research And Application Of News Automatic Classification Technology Based On Support Vector Machines
10	The Research On Text Categorization Technology Based On Partial Least Square