SPARK-based User Feature Analysis

Posted on:2018-07-30

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Zhang

Full Text:PDF

GTID:2358330515499073

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In recent years,the rapid development of the Internet provide a rich and convenient network environment,people are more and more used to communicate,exchange,and entertainment on the Internet.The user network data exists on the Internet,more and more people see the value hidden in the big data.The worldwide set off a wave of big data research.With the hot research of big data technology,attracting many scholars at home and abroad into big data mining research and realizing the research system of data mining based on user network behavior.The big data computing platform does not require ultra high performance server,but ordinary PC can be built,and this clustering model to show better compputing performance than ultra high performance servers.The distributed computing platform represented by Spark is a new technology that has just emerged and developed rapidly in recent years.The reason is that this distributed platform is a memory-based computing model that can provide mass storage and supercomputing capability.The task of mining and analyzing big data sets is solved by using cloud computing scheme,which can greatly improve the computing speed and the efficiency of user classification.Therefore,combining Spark as the representative of the distributed computing platform with classification and mining of massive user data sets will be a research direction which has a great scientific value and application potential.In this paper,the mainly research on user characteristics analysis based on Spark and improved TF-IDF algorithm,the specific works are as follows:1.Study the Spark related technology and the construction process of Spark cluster.Applying Naive Bayes classification algorithm and integrating Spark memory computing framework,the paper analyzes the information about users’video watching and builds up a classification model of genders and age ranges.The architecture of the whole analysis system is also introduced.2.The basic classification algorithm does not consider the weight of characteristic items,which can not reflect the value of each characteristic item.Based on this factor,the traditional TF-IDF weight is used for further experiments,compared with the basic classification algorithm on the classification effect.3.The defect of traditional TF-IDF weight method is listed,only consider the value of the characteristic itself can not reflect the correlation between characteristic items and categories.In order to solve this problem,the paper proposes a TFC-IDFC weight computing algorithm based on the correlation between characteristic items and categories.The process of optimizing the classification model is introduced in detail,and the classification results are obtained through experiments.4.Through the comparison in the accuracy rate and the F1 value with the basic classification algorithm and traditional TF-IDF weight computing algorithms,this TFC-IDFC weight computing algorithm is proved to provide the model with better classification ability.

Keywords/Search Tags:

Spark, user characteristics, Bayes, classification, TF-IDF

PDF Full Text Request

Related items

1	Parallel Bayesian Spam Classification System Based On Spark
2	Large Data Analysis Of User Behavior Based On Web Log
3	The Research Of User Classification Algorithm Based On The Regularized-naive Bayes
4	Research And Implementation On Feature Extraction And Classification Of Chinese Text Based On SPARK
5	Video User Classification Based On Improved TF-IDF Algorithm And Preference
6	Research On Chinese Text Feature Classification Based On Distributed Framework
7	Research On Large-scale Traffic Classification Technology Based On Spark Performance Optimization
8	Design And Implementation Of A Hibernal Tree Automatic Classification System Based On Bayes
9	Design Of User Portrait System Based On Big Data
10	Research On Business Email Classification Based On User Knowledge