Font Size: a A A

Parallelized Text Classification Algorithm Research

Posted on:2018-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:P P YuFull Text:PDF
GTID:2358330515957141Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text classification technology as a key technology in text processing is widely used in information retrieval,content filtering and theme modeling.However,with the text data increases continuously,the drawbacks of the traditional text classification technology,such as low efficiency and low accuracy,are becoming increasingly prominent.In particular,it is incompetent to meet the processing requirement of big data.Meanwhile,the centralized data processing architecture is no longer applicable to process and store big data.Hence,the parallel distributed computing framework emerges to open up a new era for the solution of these issues.With the mature of the parallel distributed computing framework,the thesis intents to improve the performance of text classification of K-Nearest Neighbors(KNN)algorithm and implements the parallelization of text classification with Spark computing framework based on the study of the fundamental theoretical knowledge of text classification.The main works have been done as follows:Firstly,the related technology of text classification and the fundamental of parallel distributed computing are studied and summarized.At first,the thesis introduces the basic concept and key technology of text classification,and then explains the architecture of Spark computing framework and its core content-RDD in detail,which above as a theoretical basis to carry out the study of parallel text classification algorithm.Secondly,during the study of KNN text classification,we find that KNN text classification algorithm has complex computation about similarity,high redundancy and slow processing speed for big data.Thus,we put forward a high efficiency KNN classification algorithm based on Spark framework and clustering.At first,the training set is cut twice by optimized K-medoids algorithm through introducing constriction factor,and then the K value is iterated constantly in the process of classification and the classification result is worked out along with it.At the same time,the data is partitioned and iterated to realize parallelization combining with the in-memory concept of Spark framework in the calculation.Thirdly,through the parallelization experiment of the text classification,we find that the partition process in the parallelization has some effect on the accuracy of KNN text classification results.Thus,we put forward a high efficiency KNN classification algorithm based on Spark framework and words relatedness.In the process of parallelization,this algorithm establishes a new distance mechanism combining with the concept of words relatedness to improve the similarity calculation.This algorithm improves the efficiency of KNN text classification.Meanwhile,it also improves the accuracy of text classification.In conclusion,this thesis concentrates on the study of the optimization and applications of KNN text classification algorithm on Spark framework based on the analysis of basic theory of text classification and parallel distributed computing.The experimental results show that the high efficiency Spark-based KNN classification algorithm designed and realized in this thesis plays a positive role in resolving the low efficiency and low accuracy problems of classification for big data,and improves the efficiency and accuracy of KNN text classification and performs effectively in classification process for large-scale text dataset.
Keywords/Search Tags:Text Classification, KNN, Parallel Distributed Computing, Spark Framework, Word Relatedness
PDF Full Text Request
Related items