| With the popularization of Internet and the rapid development of Internettechnology, the data on the internet increase exponentially. In these mass data, themajority is semi-structured and structured data, which is similar to Web text.Therefore, it is an urgent problem how to find and locate the effective informationof the user needs on the internet. In order to improve the efficiency and accuracy ofuser search, it is very important to classify the semi-structured and structured webtext data effectively. The classification technique have been developed to a certainstage, but in face of the massive data, due to the speed of development of computerhardware will never catch up the speed of the customers’ demand, the current speedof computer hardware can’t satisfy the demand of the massive data processing andquick response. The purpose of this paper is that a automatic text categorizationsystem with the massive data is designed and implemented.Based on the exponential growth of the massive data and the urgent needs ofclassification, the rise of cloud computing and the development of classificationtechnology, this paper put forward a design and implementation of automatic textcategorization based on a Hadoop cluster. First of all, the framework of Hadoopsystem, including MapReduce and Hadoop distributed file system(HDFS), isanalyzed. And the Hadoop cluster system is builded. Then, the current matureclassification technology and algorithm, the pretreatment and vectorization of textare studied. And the text classification model is made. Finally, the pretreatment oftext, including tokenization, stemming and removal of stop words at forth, is dealtwith linux shell. And the process of feature word selection, text vectorization,training and testing is dealt according to MapReduce programming model. Thisautomatic text categorization system reduces the hardware requirements, saves thecost, can satisfy the mass data high concurrent processing requirements andimproves the speed and efficiency of the data processing. |