Research And Implementation Of Chinese Text Classification Based On Hadoop And SVM Algorithm

Posted on:2016-06-22

Degree:Master

Type:Thesis

Country:China

Candidate:L L Zhang

Full Text:PDF

GTID:2208330470970573

Subject:Electronic and communication engineering

Abstract/Summary:

With the development of Internet and a wide range explosive growth of information, it provides a wealth of information for us but also brings us much trouble, so how to quickly and accurately extract valuable information from massive amounts of information is very important to us.Text classification is the basis of data mining, and it is an important guarantee for us to effectively and accurately dig valuable information from a large number of text messages, so this is a critical issue in the field of data mining on how quickly and accurately classify for large amounts of text.First, this paper introduces the development and application of text categorization and Hadoop, and studies the distributed file systems and distributed computing framework in depth as the two core of Hadoop, then analyzes the working mechanism of the distributed file system and distributed computing framework. Secondly, this paper researches the process and key technologies of text classification, compare the different Chinese text classification and other language text classification in some key technologies, and chooses the SVM algorithm of text classification as the object of study, and then researches and analysis SVM theoretical knowledge. Then in combination with Hadoop as the big data processing platform and text categorization theory, it implements text preprocessing, feature selection, weight calculations and SVM algorithm parallelism in the MapReduce framework. In order to study the effect of Chinese text classification for SVM on the platform of Hadoop, we have established a small Hadoop cluster. Based on the platform, this paper analyzes the training time and the accuracy of classification by some experiments of Chinese text classification for SVM.According to the analysis of experimental data, it shows that the traditional support vector machine for data training is not only time-consuming but also takes a lot of computing resources. If the data is too large, it may cause errors or even crash the machine. This paper implements parallel SVM algorithm based on Hadoop platform, which can improve the training time and the accuracy of text classification, especially for the large text.

Keywords/Search Tags:

Hadoop, Support Vector Machine, Big Data, Text Categorization

Related items

1	Research On Text Classification Of Mixed-kernel Parallel Support Vector Machine Based On Hadoop
2	Application For Web Text Categorization Based On Support Vector Machine
3	The Application Research Of Support Vector Machine Theory In Text Categorization
4	Study On Text Categorization Method Based On Support Vector Machine
5	The Research On Text Categorization Algorithm Based On Support Vector Machine
6	Support Vector Machine Application In Text Categorization
7	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
8	Research On Clustering And Text Categorization Based On Support Vector Machine
9	Research On Support Vector Machines Classification Algorithm In Text Categorization
10	A Study On Text Categorization Based On Machine Learning