Font Size: a A A

The Research Of Text Classification Based On Hadoop

Posted on:2013-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:C S LiuFull Text:PDF
GTID:2218330362959195Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
As the Internet develops rapidly, social networks, e-commerce, search engine and mobile computing has gone deep into people's daily life. As a result, all kinds of data shows explosive growth while the information people demand are more refined and personalized. It is of great importance to classify the vast amounts of unstructured text data, and cloud computing provides a powerful tool for the mass data processing. This thesis studied text classification based on a parallel computing platform called Hadoop. The following work has been done:(1) Studied storage, computing, virtualization and other key technologies of cloud computing. As an open-source parallel computing platform, Hadoop has gradually become the most powerful big data processing tool. This thesis did deep research on Hadoop distributed file system called HDFS and parallel programming paradign named MapReduce from the design, implementation and other aspects.(2) Applied Hadoop platform to the field of text classification. A parallel text classification framework based on MapReduce was designed according to general text classification procedure. We built a small Hadoop cluster in the local virtual machine environment and accomplished a parallel text classification algorithm with Java. The experiment results show the validity of the framework.(3) After accomplishing the parallel text classification framework, we did research on classification algorithm based on Neighborhood Component Analysis. The Neighborhood Component Analysis algorithm was not treated as a metrice learning algorithm, but rather a classification algorithm. Combined with the thought of local neighbors, a classification algorithm called K– Neighborhood Component Analysis (K-NCA) was proposed. Simulation experiments in text classification achieved good result. Finally, we analyzed the possibility to parallel these algorithms, and proposed parallel strategies using MapReduce.
Keywords/Search Tags:Text Classification, Hadoop, MapReduce, Cloud Computing, Neighborhood Component Analysis, K - Neighborhood Component Analysis
PDF Full Text Request
Related items