Font Size: a A A

Research And Application Of Feature Selection And Feature Weighting Algorithm Of Text Classification

Posted on:2018-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:C GaoFull Text:PDF
GTID:2428330566467431Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous improvement of information level in colleges and universities,colleges and universities attach importance to the establishment of interactive platform of campus network,campus network interactive platform provide complaints window,not only solve the problem of teachers and students,but also greatly improve management level of colleges and universities.Therefore,how to classify the text data from a large number of problem complaints text,find out the concerned problems of teachers and students,improve the quality of service in colleges and universities,and become the urgent task.Based on the study of teacher and student complaints text,this paper introduces the related technology of text categorization,and then studies the feature selection and feature weighting in detail,and verifies the accuracy of the improved algorithm.Finally,the improved algorithm is applied to the complaint text under the campus network interactive platform,and this paper achieve university complaints text classification system based on the Spark platform,the main work:(1)In order to improve the classical mutual information selection algorithm,this paper introduces the two factors of the characteristic frequency and the characteristic term and the improved algorithm is called the information feature selection algorithm based on word frequency Word Mutual Information,WMI).In order to verify the feasibility and validity of the improved mutual information feature selection algorithm,the experiment is validated in Chinese and English data sets respectively.The experimental results show that the WMI algorithm has a good classification effect and is an effective feature selection algorithm.(2)In order to improve the TF-IDF algorithm,the TF-IDF algorithm is improved by introducing the variance within the category and the variance between categories.In this paper,the improved TF-IDF-S is proposed to solve the problem that TF-IDF does not consider the intra-class and inter-class distribution.Finally,the TF-IDF-S algorithm is validated on the Chinese and English data sets and compared with other feature weighting algorithms.The validity of TF-IDF-S algorithm is verified by experiments.(3)Based on the above theoretical research,and considered the influence of time factor,this paper designed and implemented colleges and universities complaint text classification system based on Spark,and applied the improved feature selection algorithm and feature weighting algorithm to colleges and universities complaint text.Among them,this paper designed parallelization to the WMI algorithm,TF-IDF-S algorithm on the Spark platform,and implemented colleges and universities complaint text classification system,which has good practical value.
Keywords/Search Tags:Text Classification, Feature Selection, Term Frequency, Feature Weighting
PDF Full Text Request
Related items