Font Size: a A A

The Research Of KNN Classification Algorithm For Mass Text Based On MapReduce

Posted on:2018-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y X WangFull Text:PDF
GTID:2428330518455131Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of Internet technology,the amount of data generated by the Internet is growing,for the processing of massive data has become a serious problem.These data are generally expressed in the form of text,and the amount of data,structural dispersion,how to dig out the data that people are interested in information has become the key.Among the many classification algorithms,KNN has become one of the most widely used classification algorithms because of its simple implementation,accurate classification and high stability.However,when the training set of sample data is too large or too many characteristic words,KNN algorithm computational efficiency dropped sharply.First,the massive data makes the computational complexity of text similarity increase.Second,too much text data,but also makes the efficiency of text classification greatly reduced.Therefore,this paper mainly studies the design and implementation of the parallel KNN classification algorithm under the MapReduce framework,and proposes a classification algorithm based on the threshold partitioning of the critical value,called MKNN algorithm.In this paper,we propose a MKNN algorithm for text center point partitioning,which is based on the problem of traditional KNN text classification algorithm in large-scale text processing,which leads to the reduction of text classification efficiency.The MKNN algorithm performs the center point search for the text in the sample data set in the preprocessing phase,and then obtains the set of center points.When the text data in the data set to be sorted arrives,the MKNN only needs to be similar to the data in the center point set,then The paper analyzes the cosine theorem of the similarity calculation between texts,and improves the efficiency of text classification by using the distributed programming advantage of MapReduce to deal with the key value pairs of text similarity calculation.In this paper,we study the text categorization technology and the similarity parallelization calculation,and mainly analyzes the parallelization process of the central point partitioning algorithm and the cosine theorem solving similarity in the preprocessing process.Finally,through the experimental results,the results of the algorithm analysis and comparison.The experimental results show that the MKNN algorithm has good parallel scalability when dealing with large-scale data,and its classification effect is improved obviously when the experimental data size is similar.Which ensures the accuracy of KNN algorithm classification and improve the classification of the high efficiency.
Keywords/Search Tags:KNN, Data Partition, Text similarity, MapReduce
PDF Full Text Request
Related items