Facing Massive Amounts Of Text Classification Algorithms

Posted on:2017-01-03

Degree:Master

Type:Thesis

Country:China

Candidate:T Yang

Full Text:PDF

GTID:2348330491957960

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid popularization of personal computers in recent years,the rapid development of mobile communication business,it has a massive text data,not only lead to the Internet.These datas contain many useful information,and provide more convenience for people in their lives,but the great amount of data is all in a muddle,and is growing fast,it will cause many useful information cannot be mining.And text categorization fits to deal with the clutter data,that is the foundation of data mining and information retrieval.At present some existing text classification methods are classified by phrases and the title of document,although simply and quickly,the accuracy is poor.A part of the text classification methods for bits of text data processing is good,but for a mass of data,the effect is decreasing exponentially with the increasing data.In order to deal with the data of text classification algorithm,this paper combines with the basic process of text categorization,changes and innovate in the key algorithm of text categorization.It makes use for the open source distributed computing platform of Hadoop,parallel implementation for text categorization algorithm,on the premise of guarantee the accuracy,improving the efficiency of the mass data of text classification.First of all,this paper introduces the basic process of text classification simply,and then it introduces the key techniques of text classification detailedly.They are feature extraction,text representation,text feature selection,text categorization algorithm.and then in view of the feature extraction,this paper mainly studied the TFIDF algorithm based on weight of key,by introducing the situation of the domestic and foreign researches on TFIDF algorithm in the present,we find some deficiencies of TFIDF algorithm,then improving these deficiencies and putting forward a new TFIDF formula.For text classification algorithm,the existing text classification algorithms were analyzed in this paper,and in order to improve the efficiency and the accuracy of text classification well,the paper puts forward a kind of a new text classification algorithm based on rough set and correlation analysis.Finally,we realize the whole process of distributed parallel computing on hadoop platform,and the experiment proved that the process meet the requirements of mass data text classification,on the premise of guarantee the accuracy to improve the efficiency of text classification.

Keywords/Search Tags:

Massive amounts of text, Text categorization, Hadoop, Distributed computing

PDF Full Text Request

Related items

1	Research Of Distributed Text Categorization Based On Hadoop
2	An Implementation Of Text Categorization System Based On Hadoop
3	Research On The Hadoop-based Distributed Full-text Retrieval And Related Technologies
4	Research And Application Of Text Mining Based On Hadoop
5	Research On Parallelization Of Text Clustering Based On Hadoop
6	Research And Implementation Of Automatic Text Classification Based On Hadoop
7	Study And Implementation Of The Text Categorization Of Electricity Goods Based On Hadoop
8	Research And Implementation Of Chinese Text Classification Based On Hadoop And SVM Algorithm
9	Research On Text Classification Method Based On Hadoop
10	Research On The Parallelization Of Text Categorization Based On Convolution Neural Network