Font Size: a A A

Research Of Distributed Text Categorization Based On Hadoop

Posted on:2014-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y S JiaFull Text:PDF
GTID:2268330392973669Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The development of information technology makes the amount of informationexploding. Amount of information is stored in the form of text. As a key technology oforganizing and processing large-scale of text data, text categorization is wildly used inspam filtering, public opinion monitoring, digital library and many other fields.Before classify large-scale of text data, the preprocessing, feature selection,vectorization and other stages need a lot of computing, the compute istime-consuming and memory overhead; when using text vectors to train BP network,the training is time-consuming. To solve these problems above, this paper combiningwith the Hadoop open-source distributed computing platform, designs distributed andparallel method during the various phrases of text categorization using theMapReduce parallel programming model to improve the efficiency of categorizingtexts.First, we analyze the key technology in text categorization. Research in theHDFS distributed file system as well as the MapReduce parallel programming modelof Hadoop. then, we use MapReduce parallel programming model to decompose theChinese word segment and word frequency statistics in text preprocessing to Map andReduce tasks as well as the feature selection method and TFIDF feature weightcalculating so that it can compute in distributed and parallel. Based on the research ofBP network training method and parallel training strategy, this paper designed a BPnetwork text categorization model which is based on data-parallel and batch-trainingmethod. We distribute the text data block partition on each node, achieve thedistributed and parallel training of BP network text categorization model by parallelcomputing in every node and batch adjusting the network weights. It will finish thetraining of BP network after much iteration. Every node classifies the texts in parallelwith the BP network text categorization model which improves the efficiency of textcategorization. Finally, through experiment, verify that the distributed textcategorization method based on Hadoop proposed in this paper can improve the speedof text categorizationThe method of distributed text categorization based on Hadoop designed in thispaper can improve the efficiency of preprocess, feature selection, text vectorization intext categorization using MapReduce parallel framework. It can also improve theefficiency of training BP network when be used in text categorization. It can classify alot of texts in distributed and parallel.
Keywords/Search Tags:Text Categorization, Hadoop, Distributed, Parallel, BP network
PDF Full Text Request
Related items