Font Size: a A A

Study And Implementation Of The Text Categorization Of Electricity Goods Based On Hadoop

Posted on:2015-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:W JiangFull Text:PDF
GTID:2298330452450126Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Classification is a important task of data mining which has been widely used in real life.Now the technology is mature, but with the development of human society and the emergence ofthe outbreak of data, the classification algorithm is facing with new challenges. The current studyabout classification focuses on to improve its classification accuracy, few studies on how toimprove the classification rate. With big data, it is easy to get a massive training set to raise theaccuracy, but also requires high rate for the classification. Under massive data, improving thespeed of classification is significant. Based on Hadoop platform, I designed and implemented anBayesian text classifier with large-scale electricity goods as training set.This paper first introduces the research background and significance; then introduced thepreprocessing technical methods of document information, including segmentation and stopword processing. Segmentation includes mainstream Chinese segmentation techniques andEnglish segmentation; then introduced the vector space model for document, the featureselection algorithm to reduce the dimensions of the model and the feature weights to distinguishfeature of the contribution of classification; immediately introduces the evaluation standard oftext classifier and the NB theory. Finally, introduced knowledge of Hadoop platform, includingHDFS and Map/ReduceStudied mechanical segmentation features and finds that most current segmentation is onlyfor Chinese or defects in English. Designed and implemented an English mixed documentadaptation based on mechanical word and have a simple on ambiguity processing tokenizerstatistics, it has good segmentation results and faster word speed. It is based on lucenesegmentation Interface Analyzer to achieve and can be used in conjunction with lucene; luceneindex based on segmentation results of segmentation is to accelerate the relevant features ofword frequency statistics; at the realization of NB algorithm, using smoothe and achieve a quicksearch to improve the classification accuracy and speed of NB classifier. In order to improve therapid of classification, combining a fast search algorithm which is based on WAND algorithmand then proposed fast Bayesian algorithms. Based on the Hadoop distributed environment withlarge-scale electricity goods as training set.This classification has good accuracy, recall, F1valueand fast rapid of classification, it has a certain value.Finally, introduce the implementation process of the classifier based on Hadoop platform.then implements a books recommended system which is based on the classifier.with B/Sarchitecture, Mysql database and Java web technology.
Keywords/Search Tags:Text Classification, Hadoop, Lucene, WAND, NB, Ambiguity
PDF Full Text Request
Related items