Font Size: a A A

Research And Implementation Of Chinese Text Classification Based On Hadoop And SVM Algorithm

Posted on:2016-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:L L ZhangFull Text:PDF
GTID:2208330470970573Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet and a wide range explosive growth of information, it provides a wealth of information for us but also brings us much trouble, so how to quickly and accurately extract valuable information from massive amounts of information is very important to us.Text classification is the basis of data mining, and it is an important guarantee for us to effectively and accurately dig valuable information from a large number of text messages, so this is a critical issue in the field of data mining on how quickly and accurately classify for large amounts of text.First, this paper introduces the development and application of text categorization and Hadoop, and studies the distributed file systems and distributed computing framework in depth as the two core of Hadoop, then analyzes the working mechanism of the distributed file system and distributed computing framework. Secondly, this paper researches the process and key technologies of text classification, compare the different Chinese text classification and other language text classification in some key technologies, and chooses the SVM algorithm of text classification as the object of study, and then researches and analysis SVM theoretical knowledge. Then in combination with Hadoop as the big data processing platform and text categorization theory, it implements text preprocessing, feature selection, weight calculations and SVM algorithm parallelism in the MapReduce framework. In order to study the effect of Chinese text classification for SVM on the platform of Hadoop, we have established a small Hadoop cluster. Based on the platform, this paper analyzes the training time and the accuracy of classification by some experiments of Chinese text classification for SVM.According to the analysis of experimental data, it shows that the traditional support vector machine for data training is not only time-consuming but also takes a lot of computing resources. If the data is too large, it may cause errors or even crash the machine. This paper implements parallel SVM algorithm based on Hadoop platform, which can improve the training time and the accuracy of text classification, especially for the large text.
Keywords/Search Tags:Hadoop, Support Vector Machine, Big Data, Text Categorization
PDF Full Text Request
Related items