The Research Of KNN Classification Algorithm For Mass Text Based On MapReduce

Posted on:2018-02-18

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Wang

Full Text:PDF

GTID:2428330518455131

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In recent years,with the rapid development of Internet technology,the amount of data generated by the Internet is growing,for the processing of massive data has become a serious problem.These data are generally expressed in the form of text,and the amount of data,structural dispersion,how to dig out the data that people are interested in information has become the key.Among the many classification algorithms,KNN has become one of the most widely used classification algorithms because of its simple implementation,accurate classification and high stability.However,when the training set of sample data is too large or too many characteristic words,KNN algorithm computational efficiency dropped sharply.First,the massive data makes the computational complexity of text similarity increase.Second,too much text data,but also makes the efficiency of text classification greatly reduced.Therefore,this paper mainly studies the design and implementation of the parallel KNN classification algorithm under the MapReduce framework,and proposes a classification algorithm based on the threshold partitioning of the critical value,called MKNN algorithm.In this paper,we propose a MKNN algorithm for text center point partitioning,which is based on the problem of traditional KNN text classification algorithm in large-scale text processing,which leads to the reduction of text classification efficiency.The MKNN algorithm performs the center point search for the text in the sample data set in the preprocessing phase,and then obtains the set of center points.When the text data in the data set to be sorted arrives,the MKNN only needs to be similar to the data in the center point set,then The paper analyzes the cosine theorem of the similarity calculation between texts,and improves the efficiency of text classification by using the distributed programming advantage of MapReduce to deal with the key value pairs of text similarity calculation.In this paper,we study the text categorization technology and the similarity parallelization calculation,and mainly analyzes the parallelization process of the central point partitioning algorithm and the cosine theorem solving similarity in the preprocessing process.Finally,through the experimental results,the results of the algorithm analysis and comparison.The experimental results show that the MKNN algorithm has good parallel scalability when dealing with large-scale data,and its classification effect is improved obviously when the experimental data size is similar.Which ensures the accuracy of KNN algorithm classification and improve the classification of the high efficiency.

Keywords/Search Tags:

KNN, Data Partition, Text similarity, MapReduce

PDF Full Text Request

Related items

1	Design And Implementation Of Similarity Self - Connection Algorithm For Massive Data Sets Based On MapReduce
2	Research On Complex Distance Measure Based MapReduce Similarity Join Techniques
3	Research And Implementation Of Data Placement And Query Techniques Based On MapReduce In Distributed Multi-Dimensional Data Warehouse
4	The Research And Implementation Of Comprehensive Mapreduce
5	Research On The Clustering Algorithm Of Parallel Partition Based On MapReduce
6	Techniques Of Partition And Query In Data Warehouses Based On Hadoop
7	Research On Partition Selection Strategy For Big Data Management Based On KNN Connection Processing
8	Research And Implementation Of The Text Cluster Based On Text Similarity Caculation
9	The Research Of Big Data Text Classification Method Based On Mapreduce
10	Key Value Based Algorithm For Solving Reduce Load Imbalance In Mapreduce