The Key Technologies Research Of Web Text Mining Based On Hadoop

Posted on:2013-03-09

Degree:Master

Type:Thesis

Country:China

Candidate:L L Chen

Full Text:PDF

GTID:2248330371986193

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology, the produce and storage of datareaches an unprecedented prosperity stage. At the same time, how to extract valuable andpotential useful information from huge data is a big challenge to the traditional data miningtechnology, and the data mining method based on cloud computing arises at the right moment.Hadoop is an open source platform of cloud computing technology and its core technologyis hadoop distributed file system and MapReduce programming model. In this platform, files arestored in the hadoop distributed file system, and MapReduce programming framework is used torealize parallel computing. Because it is convenient and fast to realize computer cluster andhandle large data set, it is meaningful to transplant the traditional data mining methods to thehadoop platform, and the key to the technology is the parallelism realization of traditional datamining.So far, the data mining research based on hadoop has made some research results in somefields, but further promotion is still needed in the field of research. Base on the theory of cloudcomputing and data mining, this paper mainly studies the text pretreatment algorithm and itsimprovement, the support vector machines (SVM) and its improvement and the realization of theparallel SVM algorithm. The main contents and research results are as follows:1. The research of Hadoop distributed platform and Web mining theory. In this part, thein-depth research of HDFS and MapReduce programming framework is made, and then webmining theory and algorithm is introduced in detail.2. Pretreatment of the web text. From the angle of web text pretreatment, the concrete stepsand related algorithm of pretreatment process are researched. In the traditional feature model, theinfluence of characteristic vector whose weights is small is not fully considered, so, an improvedmodel is raised. In this model, the average of the feature vector is got, and then make themstandardization, in this case, all the feature items are in the same starting point in the work of textclassification. At last, the superiority of the improved model is verified through experiment.3. The improved SVM algorithm and its parallel implementation. In this part, the SVMalgorithm is researched in first, and then an improved SVM algorithm is put forward accordingto the problems of the existing algorithm, that is the study and generalization ability of SVMalgorithm improve by changing kernel function, and the effect of classification will be better. Atthe same time, considering the classification time, the parallel SVM algorithm is realized by some parallel strategies in this paper, and it is transplanted to hadoop platform. In the last ofpaper, availability of parallel SVM algorithm and superiority of the improved SVM algorithmare verified through the experimental.4. Hadoop cluster environment is built in the paper, the support vector machine classifier isconstructed using java language, and it is evaluated by some evaluation index.

Keywords/Search Tags:

Data Mining, Web Text Mining, SVM, Hadoop, Parallel Computing

PDF Full Text Request

Related items

1	Parallel Data Mining Algorithms Research Of Hadoop
2	Research On Parallel Processing Technology Of Large-scale Text Mining Under Cloud Computing Environment
3	Research And Implementation Of Big Data Analysis And Mining Technology Based On Hadoop In Telecommunications Industry
4	The Reseach Of Data Mining Based On HADOOP
5	Research And Application Of Text Mining Based On Hadoop
6	The Research And Implement Of Data Mining Algorithms Based On Hadoop
7	Based On The Parallel Implementation Of Multi-node Data Mining Algorithm
8	The Research And Implement Of Data Mining Algorithms Based On Hadoop
9	Research On Text Mining Based On MapReduce
10	Research And Design Of Data Mining System For Tcm Disease Based On Cloud Computing Environment