Font Size: a A A

The Research Of MapReduce Implementing Of Text Classification KNN Algorithm Based On Mass Data

Posted on:2016-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:C X HanFull Text:PDF
GTID:2348330542975452Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology,the scale of information data has increased significantly.In order to organize and manage such huge size of data sets,scientist proposed many data mining technologies including content-based data mining technique.K-NN algorithm is one of the most popular content-based data mining algorithm,which is always used for classify documents.Although this algorithm perform well in many cases,K-NN algorithm meets its bottleneck when dealing with large scale of computing data.In recent years,with the mature of Hadoop technology,providing technical support to solve the problem of KNN algorithm in text classification.In order to solve the inefficiency of previous algorithm,we propose a new distribute text classification algorithm based on KNN and MapReduce algorithm.In this paper we firstly introduce the pre-edition technique,feature vector extraction methods,vector space model for documents and traditional text classification algorithm.Then we generalize the distributed file system based on Hadoop--HDFS and the Mapreduce algorithm.After that,we analysis the existed KNN classification algorithm and propose a new method for improving existed KNN algorithm,which is to combine the Mapreduce algorithm with KNN classification algorithm.Finally,we applied experimental studies on our new algorithm with the Newsgroup-18828 dataset.The experimental results show that our algorithm significantly improves the existed KNN algorithm.We intense the research of characteristics for the key technology of text classification and KNN algorithm.A parallelism algorithm of KNN text classification based on the MapReduce programming model is implemented.According to the experimental data of KNN algorithm dealing with individual data and clusters,the thesis proves that the parallelization on cloud platform can reduce the time of large-scale data computing,and the thesis analyzes the influence of two major performance parameters on the time of job running.The thesis builds a Hadoop cluster on five nodes and designs four different experiment plans.The results show that: 1)When the effective computation time operations accounted for the proportion of the total running time is small,the advantages of small cluster will not display;2)The MapReducized KNN text classification algorithm has good speed-up;3)When less intermediate data generated by the Map task,the method byincreasing the Map Task memory buffer value to optimize the operation is not desirable;4)Single node failure seriously affects the performance of jobs,Especially when the cluster scale is small.
Keywords/Search Tags:Hadoop, text classification, KNN, MapReduce
PDF Full Text Request
Related items