Font Size: a A A

Facing Massive Amounts Of Text Classification Algorithms

Posted on:2017-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:T YangFull Text:PDF
GTID:2348330491957960Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid popularization of personal computers in recent years,the rapid development of mobile communication business,it has a massive text data,not only lead to the Internet.These datas contain many useful information,and provide more convenience for people in their lives,but the great amount of data is all in a muddle,and is growing fast,it will cause many useful information cannot be mining.And text categorization fits to deal with the clutter data,that is the foundation of data mining and information retrieval.At present some existing text classification methods are classified by phrases and the title of document,although simply and quickly,the accuracy is poor.A part of the text classification methods for bits of text data processing is good,but for a mass of data,the effect is decreasing exponentially with the increasing data.In order to deal with the data of text classification algorithm,this paper combines with the basic process of text categorization,changes and innovate in the key algorithm of text categorization.It makes use for the open source distributed computing platform of Hadoop,parallel implementation for text categorization algorithm,on the premise of guarantee the accuracy,improving the efficiency of the mass data of text classification.First of all,this paper introduces the basic process of text classification simply,and then it introduces the key techniques of text classification detailedly.They are feature extraction,text representation,text feature selection,text categorization algorithm.and then in view of the feature extraction,this paper mainly studied the TFIDF algorithm based on weight of key,by introducing the situation of the domestic and foreign researches on TFIDF algorithm in the present,we find some deficiencies of TFIDF algorithm,then improving these deficiencies and putting forward a new TFIDF formula.For text classification algorithm,the existing text classification algorithms were analyzed in this paper,and in order to improve the efficiency and the accuracy of text classification well,the paper puts forward a kind of a new text classification algorithm based on rough set and correlation analysis.Finally,we realize the whole process of distributed parallel computing on hadoop platform,and the experiment proved that the process meet the requirements of mass data text classification,on the premise of guarantee the accuracy to improve the efficiency of text classification.
Keywords/Search Tags:Massive amounts of text, Text categorization, Hadoop, Distributed computing
PDF Full Text Request
Related items