Font Size: a A A

Design And Implementation Of Text Classification System Based On Hadoop Platform

Posted on:2019-09-17Degree:MasterType:Thesis
Country:ChinaCandidate:K LiuFull Text:PDF
GTID:2428330596460601Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In the last decade or so,great progress has been made in information technology and computer technology.The total amount of information in the world has grown exponentially,and traditional technologies have not been able to extract valuable information from the vast amount of data.The text categorization technology classifies the texts in the text categorization to be categorized according to different categories,and thus can more easily manage mass text data.There are many text classification algorithms.The classical KNN classification algorithm is a frequently used algorithm.The principle of the classical KNN classification algorithm is very simple,and the classification accuracy is relatively high.However,the classical KNN classification algorithm has high computational complexity,long operation time,and high hardware requirements.The SKNN classification algorithm has improved it to reduce the complexity while maintaining high classification accuracy.The amount of text in each category of the training text library may be unbalanced.In this case,in order to further reduce the computational complexity,an improved SKNN classification algorithm is proposed in this paper,and its superiority is proved through experiments.Big data has become a hot issue in computer science.If you use traditional stand-alone processing methods for massive data processing,you will face many difficult problems to overcome.Not only is it difficult to store such a large amount of data,but it is also difficult to calculate the time and accuracy of data.The Hadoop platform is an open source cloud computing platform.Users can store and calculate big data on the Hadoop platform.This paper mainly studies the design and implementation of text classification system based on Hadoop platform.Text preprocessing is a key step in text classification technology.A single text file is a small file.If a large number of small files are directly stored in the HDFS distributed file system,it will cause a huge waste of resources.According to the way of storing data in HDFS distributed file system and the characteristics of text data,this paper redesigned an efficient text data storage method.English only needs word segmentation according to words,compared to Chinese word segmentation is more complicated.This paper analyzes the other steps involved in Chinese word segmentation,feature selection,and text preprocessing.In order to make full use of the advantages of Hadoop platform for big data processing,this paper designed a text preprocessing program based on MapReduce programming model to process parallel text data sets.The design of text classification algorithms is another key step in text classification technology.In view of the shortcomings of the classical KNN classification algorithm and SKNN classification algorithm,this paper proposes an improved SKNN classification algorithm based on cutting subsets.The algorithm cuts each category in the training text library into S subsets.The text to be classified first finds the K subsets closest to its own distance from all the subsets through calculation,and then finds the closest K texts from the K subsets.Based on the categories of the K texts,determine the categories of the texts to be classified.In this paper,several algorithms have been tested in stand-alone and Hadoop cluster environments.Compared with classical KNN classification algorithms,the improved SKNN classification algorithm has lower computational complexity.Compared with SKNN classification algorithm,the improved SKNN classification algorithm is The accuracy of the classification is higher when the number of texts in each category of the training text library is unbalanced.This paper analyzes the principle of text classification counting,introduces the components and basic working principles of each part of the Hadoop platform,and builds pseudo distributed clusters through virtual machines,studies the parallel text classification,and designs and implements the Hadoop-based platform.Text classification system.Experimental results show that the improved SKNN classification algorithm based on Hadoop platform is an effective text classification algorithm.With the maturation of big data technology and the promotion of the Hadoop platform,it can be foreseen that research on the parallelization of text classification will surely achieve even greater results.Further research work can be carried out from the direction of classification algorithms,such as the study of naive Bayes and support vector machines.
Keywords/Search Tags:Text Categorization, Big Data, Hadoop platform, KNN, MapReduce
PDF Full Text Request
Related items