Font Size: a A A

Research On Classification Of Web User Access Preferences Based On Hadoop

Posted on:2017-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:S F JiangFull Text:PDF
GTID:2278330485450740Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the information on the Internet growing exponentially, the classification algorithm is facing huge challenges in front of the large-scale data. At present, the research of classification model and algorithm mainly focused on improving the classification accuracy and the complexity of time and space., However, with big data,The main problem is huge amounts of data of storage and computing,traditional processing methods in these two aspects cannot meet the needs,therefore,study how to categorize huge amounts of data quickly and efficiently is of great significance.This paper is based on the Hadoop distributed computing platform, for the defects of weight calculation of traditional naive bayesian algorithm,An improved weighted naive bayes algorithm, and apply on statistics Web users access preferences.First this thesis introduces the research background and significance, and research status;And then introduced the project involved related technologies, including pretreatment,model representation, choice of key words, feature weighting computation in the process of document categorization, bayesian theory, naive bayesian classification algorithm, the relevant technology of the Hadoop distributed computing platform, mainly including the Hadoop distributed storage and Map Reduce distributed computing.Then, based on the Hadoop platform presents a in both English and Chinese segmentation algorithm, introduced Lucene in the process of word segmentation, and through the statistics of ambiguity processing.In view of the defects in Hadoop platform with small file slowly,implements a several small text will be merged into one big file input format,experiments prove that the custom input format can very good deal with small file input。In view of the traditional naive bayesian classification algorithm the flaws of the weight calculation, an improved weighted naive bayes classification algorithm, and through the five Map Reduce process to achieve, in the Hadoop platform, using 8237 data sets, experiments show that the improved weighted naive bayes classification algorithm on the Mac Avg_F1 and Mic Avg_F1 has very good effect.Finally, through the research of parallel word segmentation technology and the improved naive bayesian classification algorithm, classifying web users to access the page content, and through the Pig statistical analysis of the preferences. Precision marketing has a certain commercial value for operators.
Keywords/Search Tags:Hadoop, Naive bayes algorithm, Lucene, Ambiguity processing
PDF Full Text Request
Related items