Font Size: a A A

The Key Technologies Research Of Web Text Mining Based On Hadoop

Posted on:2013-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:L L ChenFull Text:PDF
GTID:2248330371986193Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, the produce and storage of datareaches an unprecedented prosperity stage. At the same time, how to extract valuable andpotential useful information from huge data is a big challenge to the traditional data miningtechnology, and the data mining method based on cloud computing arises at the right moment.Hadoop is an open source platform of cloud computing technology and its core technologyis hadoop distributed file system and MapReduce programming model. In this platform, files arestored in the hadoop distributed file system, and MapReduce programming framework is used torealize parallel computing. Because it is convenient and fast to realize computer cluster andhandle large data set, it is meaningful to transplant the traditional data mining methods to thehadoop platform, and the key to the technology is the parallelism realization of traditional datamining.So far, the data mining research based on hadoop has made some research results in somefields, but further promotion is still needed in the field of research. Base on the theory of cloudcomputing and data mining, this paper mainly studies the text pretreatment algorithm and itsimprovement, the support vector machines (SVM) and its improvement and the realization of theparallel SVM algorithm. The main contents and research results are as follows:1. The research of Hadoop distributed platform and Web mining theory. In this part, thein-depth research of HDFS and MapReduce programming framework is made, and then webmining theory and algorithm is introduced in detail.2. Pretreatment of the web text. From the angle of web text pretreatment, the concrete stepsand related algorithm of pretreatment process are researched. In the traditional feature model, theinfluence of characteristic vector whose weights is small is not fully considered, so, an improvedmodel is raised. In this model, the average of the feature vector is got, and then make themstandardization, in this case, all the feature items are in the same starting point in the work of textclassification. At last, the superiority of the improved model is verified through experiment.3. The improved SVM algorithm and its parallel implementation. In this part, the SVMalgorithm is researched in first, and then an improved SVM algorithm is put forward accordingto the problems of the existing algorithm, that is the study and generalization ability of SVMalgorithm improve by changing kernel function, and the effect of classification will be better. Atthe same time, considering the classification time, the parallel SVM algorithm is realized by some parallel strategies in this paper, and it is transplanted to hadoop platform. In the last ofpaper, availability of parallel SVM algorithm and superiority of the improved SVM algorithmare verified through the experimental.4. Hadoop cluster environment is built in the paper, the support vector machine classifier isconstructed using java language, and it is evaluated by some evaluation index.
Keywords/Search Tags:Data Mining, Web Text Mining, SVM, Hadoop, Parallel Computing
PDF Full Text Request
Related items