Study And Implementation Of The Text Categorization Of Electricity Goods Based On Hadoop

Posted on:2015-04-21

Degree:Master

Type:Thesis

Country:China

Candidate:W Jiang

Full Text:PDF

GTID:2298330452450126

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

Classification is a important task of data mining which has been widely used in real life.Now the technology is mature, but with the development of human society and the emergence ofthe outbreak of data, the classification algorithm is facing with new challenges. The current studyabout classification focuses on to improve its classification accuracy, few studies on how toimprove the classification rate. With big data, it is easy to get a massive training set to raise theaccuracy, but also requires high rate for the classification. Under massive data, improving thespeed of classification is significant. Based on Hadoop platform, I designed and implemented anBayesian text classifier with large-scale electricity goods as training set.This paper first introduces the research background and significance; then introduced thepreprocessing technical methods of document information, including segmentation and stopword processing. Segmentation includes mainstream Chinese segmentation techniques andEnglish segmentation; then introduced the vector space model for document, the featureselection algorithm to reduce the dimensions of the model and the feature weights to distinguishfeature of the contribution of classification; immediately introduces the evaluation standard oftext classifier and the NB theory. Finally, introduced knowledge of Hadoop platform, includingHDFS and Map/ReduceStudied mechanical segmentation features and finds that most current segmentation is onlyfor Chinese or defects in English. Designed and implemented an English mixed documentadaptation based on mechanical word and have a simple on ambiguity processing tokenizerstatistics, it has good segmentation results and faster word speed. It is based on lucenesegmentation Interface Analyzer to achieve and can be used in conjunction with lucene; luceneindex based on segmentation results of segmentation is to accelerate the relevant features ofword frequency statistics; at the realization of NB algorithm, using smoothe and achieve a quicksearch to improve the classification accuracy and speed of NB classifier. In order to improve therapid of classification, combining a fast search algorithm which is based on WAND algorithmand then proposed fast Bayesian algorithms. Based on the Hadoop distributed environment withlarge-scale electricity goods as training set.This classification has good accuracy, recall, F1valueand fast rapid of classification, it has a certain value.Finally, introduce the implementation process of the classifier based on Hadoop platform.then implements a books recommended system which is based on the classifier.with B/Sarchitecture, Mysql database and Java web technology.

Keywords/Search Tags:

Text Classification, Hadoop, Lucene, WAND, NB, Ambiguity

PDF Full Text Request

Related items

1	Research On Classification Of Web User Access Preferences Based On Hadoop
2	Application Research Of Text Classification Based On Hadoop Platform
3	Research Of LDA Short Text Classification Algorithm Based On Hadoop Platform
4	Design And Implementation Of University Digtil Library System Based On Hadoop
5	Research And Implementation Of Automatic Text Classification Based On Hadoop
6	Research And Implementation Of Chinese Text Classification Based On Hadoop And SVM Algorithm
7	Research On Text Classification Method Based On Hadoop
8	The Design And Implementation Of A CBIR System Based On Hadoop And Lucene
9	Design And Implementation Of Wireless City Full-Text Retrieval System Based On Lucene
10	Design And Implementation Of Data Mining Classification System Based On Hadoop