The Research Of Text Classification Based On Hadoop

Posted on:2013-01-29

Degree:Master

Type:Thesis

Country:China

Candidate:C S Liu

Full Text:PDF

GTID:2218330362959195

Subject:Control theory and control engineering

Abstract/Summary:

PDF Full Text Request

As the Internet develops rapidly, social networks, e-commerce, search engine and mobile computing has gone deep into people's daily life. As a result, all kinds of data shows explosive growth while the information people demand are more refined and personalized. It is of great importance to classify the vast amounts of unstructured text data, and cloud computing provides a powerful tool for the mass data processing. This thesis studied text classification based on a parallel computing platform called Hadoop. The following work has been done:(1) Studied storage, computing, virtualization and other key technologies of cloud computing. As an open-source parallel computing platform, Hadoop has gradually become the most powerful big data processing tool. This thesis did deep research on Hadoop distributed file system called HDFS and parallel programming paradign named MapReduce from the design, implementation and other aspects.(2) Applied Hadoop platform to the field of text classification. A parallel text classification framework based on MapReduce was designed according to general text classification procedure. We built a small Hadoop cluster in the local virtual machine environment and accomplished a parallel text classification algorithm with Java. The experiment results show the validity of the framework.(3) After accomplishing the parallel text classification framework, we did research on classification algorithm based on Neighborhood Component Analysis. The Neighborhood Component Analysis algorithm was not treated as a metrice learning algorithm, but rather a classification algorithm. Combined with the thought of local neighbors, a classification algorithm called K– Neighborhood Component Analysis (K-NCA) was proposed. Simulation experiments in text classification achieved good result. Finally, we analyzed the possibility to parallel these algorithms, and proposed parallel strategies using MapReduce.

Keywords/Search Tags:

Text Classification, Hadoop, MapReduce, Cloud Computing, Neighborhood Component Analysis, K - Neighborhood Component Analysis

PDF Full Text Request

Related items

1	The Research Of Mapreduce Implementing Of Text Classification Algorithm Based On Mass Data
2	Research On Classification Learning Based On Rough Sets
3	The Mapreduce Model In The Hadoop Implementation Of Performance Analysis And Optimization Improvements
4	The Cloud Computing Based On Hadoop Platform And Log Analysis
5	Face Recognition Using Supervised Independent Component Analysis
6	The Research And Realization Of The Military Port Objects Classification Platform
7	Research On Neighborhood-based Efficient Classification Algorithm And Its Applications
8	Research On Decision Tree Classification Algorithm Based On Hadoop
9	Research On Classification Algorithm Used HADOOP
10	Design And Implementation Of Hadoop Cluster Web Log Analysis System Based On Eucalyptus