Font Size: a A A

Application Research Of Text Classification Based On Hadoop Platform

Posted on:2016-07-12Degree:MasterType:Thesis
Country:ChinaCandidate:B B LinFull Text:PDF
GTID:2308330470973734Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid increment of online document information,text classification become a key technology in classifying and organizing text data,so to study how to classify massive unstructured text data has great significance, and cloud computing provides a powerful tool for massive data processing.For this reason,this paper does a research of text classification based on Hadoop which is an open source distributed platform. In this paper,the following work has been done.1. Studied the principles and system architecture of Hadoop, including the principles and operation mechanism of Hadoop’s two cores(HDFS and MapReduce),the new generation of MapReduce:YARN, installation and configuration methods of Hadoop.2.Study the theories and key technologies for text classification,the implementation process of text categorization and the key technologies involved in every step, including text preprocessing,vector space model,feature weight calculations, reduction of text feature dimension and so on. Study Naive Bayes and KNN algorithms and the implementation of classifiers according to them.Do text classification experiments and analysis the difference between those classifiers.3.Through restructuring the text classification steps by MapReduce model,design and implement of a parallel version of the Bayes classifier and KNN classifier respectively.4.Study of dimensionality reduction methods,such as PCA,mapping feature words to concepts.Implement KNN classifiers with dimensionality reduction based on PCA and HowNet respectively.Do experiments and analysis the differences between the two dimensionality reduction methods.Then design and implement the the parallel version of KNN classifier with dimensionality reduction based on HowNet by using the MapReduce model.Finally do text classification on Hadoop clusters to oberserve how distribute characteristic affect the efficiency of classifier and also verify the correctness of MapReduce program to the classifier.
Keywords/Search Tags:Text Classification, Hadoop, MapReduce, KNN, Bayes, HowNet, dimensionality reduction
PDF Full Text Request
Related items