Font Size: a A A

An Implementation Of Text Categorization System Based On Hadoop

Posted on:2014-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z P PanFull Text:PDF
GTID:2298330422990700Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and the populority of theInternet, the data on the internet increase exponentially. In these mass data, themajority is semi-structured and structured data, which is similar to Web text.Therefore, it is an urgent problem how to find and locate the effective informationof the user needs on the internet. In order to improve the efficiency and accuracy ofuser search, it is very important to classify the semi-structured and structured webtext data effectively. The classification technique have been developed to a certainstage, but in face of the massive data, due to the speed of development of computerhardware will never catch up the speed of the customers’ demand, the current speedof computer hardware can’t satisfy the demand of the massive data processing andquick response. The purpose of this paper is that a text categoriztion system with themassive data is designeed and implemented.Based on the exponential growth of the massive data and the urgent needs ofclassification, the rise of cloude computing and the development of classificationtechnology, this paper put forward a design and an implementation of textcategorization based on a Hadoop cluster. First of all, the framework of Hadoopsystem is analyzed. And the Hadoop cluster system is builded. Then, the currentmature classification technology and algorithm, the pretreament and vectorizationof text are studied. And the text classification model is made. Finally, the process offeature word selection, text vectorization, training and testing is dealt according toMapReduce programing model. This text categorization system reduces thehardware requirements, saves the cost, can satisfy the mass data high concurrentprocessing requirements and improves the speed and efficiency of the dataprocessing.
Keywords/Search Tags:Hadoop, text categorization, MapReduce, HDFS
PDF Full Text Request
Related items