Font Size: a A A

Research On Text Feature Selection And Classification Algorithms

Posted on:2020-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:P Q LiFull Text:PDF
GTID:2428330590971613Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of Internet information technology in the 21 st century,text classification as an effective means of text data management has become a hot research topic.However,most of the text data exists in a messy form,which causes problems such as a large number of document features and uneven distribution,which seriously affects the classification accuracy and classification efficiency of text data.Based on this,this paper studies the feature selection and text classification algorithms in text categorization based on the predecessors,and gives improved mutual information feature selection model,KNN classification algorithm based on K-center point and rough set theory(K Center Point and Rough Set KNN,KRS-KNN)to solve these problems.For the traditional mutual information feature selection algorithm,the problem of feature word frequency,part of speech and feature word distribution is not considered.On the basis of the traditional mutual information model,the intra-class feature frequency,feature coverage rate and part-of-speech coefficient are combined.Indicators,constructing a new mutual information evaluation function.The feature vectorization is then performed by the vector space model,and the text feature set is classified by the KNN classification model.Finally,the algorithm is verified by experiments.The experimental results show that the algorithm has significant effects on feature selection and improves the accuracy of text classification.At the same time,compared with the traditional mutual information model,the recall and F1 values of the classification are also improved,which proves the validity and feasibility of the algorithm applied to text classification.In the process of text categorization,due to the high feature dimension and the difficulty of calculation,the classification efficiency is low.Therefore,a KNN classification algorithm based on K-center point and rough set is introduced.Firstly,based on the K-center algorithm,the method clusters the text data sets into clusters,and calculates the cost function values of the cluster core and other text data in each cluster separately,and sets the threshold value to generate the data with higher value.Sample culling to reduce the data size of the text collection and reduce the amount of calculation.Then,using the rough set theory,the data samples of the identified categories are no longer judged to belong to the category,and the uncertain data is classified by the KNN classification algorithm.Finally,the effectiveness of thealgorithm is verified by experiments.The results show that the algorithm can effectively eliminate the useless text data with the help of the cost function and rough set of the K-center point,reduce the computational scale of the text collection,and greatly reduce the data processing time and effectively improve the algorithm.The classification efficiency of text data.
Keywords/Search Tags:Text classification, mutual information, feature selection, rough set, K center
PDF Full Text Request
Related items