Font Size: a A A

Research On Feature Reduction And Semantic Weighting Algorithm Based On N-grams

Posted on:2016-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:S X LiuFull Text:PDF
GTID:2428330548977869Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Text feature is very important for text categorization,and it has a direct impact on the performance of the classification model and the final test results.Compared with other features,n-grams has many advantages,but its own shortcomings limit its wide application in text categorization:1)Too much sparse data.2)Feature redundancy.3)High dimension.In order to overcome the three defects mentioned above,n-grams can be used in the field of text categorization better.The paper proposes a feature reduction algorithm based on n-grams language model and a semantic weighting algorithm.Firstly,the algorithm reduces the dimension of the untreated n-grams feature set,reduces the overhead to a single n-grams feature,and then removes the redundant words in each n-grams feature to achieve the goal of n-grams reduction;finally,the test text or the training text is weighted to avoid the absolute matching of the 0 data.Experimental results in the Netease text corpus show that the proposed algorithm can accurately select high quality n-grams features from the text,and avoid the high dimensionality,redundant words and sparse data.In Vector Machine SVM(Support),the classification performance is greatly improved compared to the algorithm.
Keywords/Search Tags:feature selection, feature weighting, n-grams language model, semantic approximation, redundancy, relevance
PDF Full Text Request
Related items