Font Size: a A A

Research On Text Classification Method Based On Improved Feature Selection Algorithm

Posted on:2019-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:X FuFull Text:PDF
GTID:2438330548954991Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,unstructured texts(news,web,mail,etc.)are growing at an exponential rate.Efficiently categorizing unstructured texts(hereinafter referred to as text categorization)have important theoretical and practical significance for research fields such as information retrieval,recommendation systems,news classification,and spam detection,and has always been a research hotspot at home and abroad.Text categorization refers to the process of assigning a range of classification systems to classify unstructured texts of unknown categories into one or more categories.This text mainly carries on the thorough research to DFS algorithm and TF-IDF algorithm in the text classification,the research content and the innovation mainly include the following several aspects:(1)Aiming at the insufficiency of DFS algorithm's accuracy under unbalanced data set,a DFS-sCHI algorithm based on the two-face feature of feature words is proposed to improve the accuracy of text classification.The traditional DFS feature selection algorithm lacks the uneven distribution of samples in the dimension reduction process,and the negative feature words have the effect on the category influencing factors,which leads to the decrease of the classification accuracy under the unbalanced data set.This paper comprehensively considers the insufficiency of DFS and carries out optimization processing.Combining DFS with CHI chi-square detection algorithm,this paper proposes a DFS-sCHI feature selection algorithm based on the two-face feature of feature words.The algorithm is divided into two filters.The first time,DFS scores are available for selection.Feature words,the second time the feature words were added to the category negative correlation concept,and the CHI factor was used to mark the relevance of the feature words and categories,and the feature words were evenly selected according to the categories.The experimental results show that under the unbalanced data set,DFS-s CHI is significantly improved in classification accuracy compared to DFS.(2)For TF-IDF algorithm ignoring the deficiencies of category information and ignoring location factors in feature weighting,a TF-pDFS algorithm is proposed to improve the accuracy of text categorization.TF-IDF algorithm is a commonly used weighting strategy for text categorization.When evaluating feature words,it only involves the importance of the current document,while ignoring the intrinsic relationship between feature words and category information.This paper proposes a TF-pDFS algorithm.The algorithm firstly introduces the DFS factor to measure the relevance degree between the feature words and the category,and then adds the DFS adjustment factor to optimize the impact of the unbalanced data on the result.Finally,it analyzes the relationship between the distance factor and the importance of the feature word and introduces the distance.Regulatory factor.Experimental results show that TF-pDFS algorithm can effectively improve the classification accuracy.(3)Based on the work of this paper,design and implement a bird auto-categorization automatic classification prototype system.
Keywords/Search Tags:Feature selection, Feature weighting, neural network, text classification
PDF Full Text Request
Related items