| With the rapid development of information technology and Internet technology,the text information is increasing exponentially.How to deal with such huge and sharply increasing mass data has become a challenge to the field of information science and technology.Text classification,which is a key technology to organize and manage massive text data,can solve various kinds of problems for information in a large extent.Moreover,it can help users to quickly retrieve,query,filter and use information.This paper studies and researches the text classification and its related technologies.It focuses on each step during the process of text classification,including text preprocessing,text feature extraction,term weighting algorithm,text classification algorithm and performance evaluation.Text classification algorithm and term weighting algorithm are the key issues in text classification.Term weighting algorithm has a great effect on the precision rate of classification result,while classification algorithm has direct impact on the efficiency and accuracy of classification.Therefore,this paper mainly focuses on these two problems.The research contents and innovation points in this paper are revealed from the following aspects:Firstly,this paper focus on the research and improvement of TFIDF algorithm.Term weighting algorithm has a significant influence on the classification results,and the TFIDF is one of the most popular weight algorithms in VSM model.The traditional TFIDF algorithm ignores not only the semantic relation between terms and other feature words,but also the proportion of distribution of terms in categories and between categories of the text datasets.To solve the problem,based on information entropy and information gain,this paper introduces the semantic relation and proposes an improved TFIDF algorithm(S-TFIDFIGE)combining semantics with information entropy and information gain.Secondly,more attention is paid to the research and improvement of KNN algorithm.KNN algorithm is a widely used algorithm for text classification becauseof its simple realization and high accuracy.However,since KNN has high computation complexity and low efficiency which limit its application in classification for large quantity of text.MapReduce,a distributed parallel computation model,has strong universal and scalability and can manage mass data effectively.After making a deep analysis of the characteristics of KNN algorithm and the advantage of Hadoop MapReduce programming model,this paper proposes a PKNN algorithm which based on MapReduce parallel computation.At last,a series of experiments are designed and complemented to verify the feasibility and validity of improved algorithm S-TFIDFIGE and PKNN,moreover,the improved S-TFIDFIGE algorithm and PKNN algorithm are combined for text classificationBy combining the S-TFIDFIGE algorithm and PKNN algorithm proposed in this paper can not only speed up the classification efficiency,but also improve the accuracy of text classification.It can be applied to manage the classification of a large amount of text. |