Research And Implementation Of Automatic Text Classification Based On Hadoop

Posted on:2014-04-07

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Zhang

Full Text:PDF

GTID:2268330392469625

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

With the popularization of Internet and the rapid development of Internettechnology, the data on the internet increase exponentially. In these mass data, themajority is semi-structured and structured data, which is similar to Web text.Therefore, it is an urgent problem how to find and locate the effective informationof the user needs on the internet. In order to improve the efficiency and accuracy ofuser search, it is very important to classify the semi-structured and structured webtext data effectively. The classification technique have been developed to a certainstage, but in face of the massive data, due to the speed of development of computerhardware will never catch up the speed of the customersâ€™ demand, the current speedof computer hardware canâ€™t satisfy the demand of the massive data processing andquick response. The purpose of this paper is that a automatic text categorizationsystem with the massive data is designed and implemented.Based on the exponential growth of the massive data and the urgent needs ofclassification, the rise of cloud computing and the development of classificationtechnology, this paper put forward a design and implementation of automatic textcategorization based on a Hadoop cluster. First of all, the framework of Hadoopsystem, including MapReduce and Hadoop distributed file system(HDFS), isanalyzed. And the Hadoop cluster system is builded. Then, the current matureclassification technology and algorithm, the pretreatment and vectorization of textare studied. And the text classification model is made. Finally, the pretreatment oftext, including tokenization, stemming and removal of stop words at forth, is dealtwith linux shell. And the process of feature word selection, text vectorization,training and testing is dealt according to MapReduce programming model. Thisautomatic text categorization system reduces the hardware requirements, saves thecost, can satisfy the mass data high concurrent processing requirements andimproves the speed and efficiency of the data processing.

Keywords/Search Tags:

text categorization, Hadoop, MapReduce, HDFS, feature wordselection

PDF Full Text Request

Related items

1	An Implementation Of Text Categorization System Based On Hadoop
2	On Bavesian Text Classification Learning Under Mapreduce Framework
3	Design And Implementation Of Text Classification System Based On Hadoop Platform
4	Research On Big Data Text Analysis Based On Hadoop Architecture
5	The Performance Optimization And Improvement Of MapReduce In Hadoop
6	Research On Text Classification Method Based On Hadoop
7	Research On Distributed Processing Of Massive Video Data Based On Hadoop
8	Research Of Text Feature Selection Algorithm Based On Hadoop
9	Working Principle And Applied Research Of MapReduce
10	MapReduce Performance Research And Optimization Based On Block Aggregation