Font Size: a A A

The Research And Application Of Parallel Latent Dirichlet Allocation And Clustering Algorithm

Posted on:2017-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:Q Y WanFull Text:PDF
GTID:2348330488977974Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the society has entered a era of data explosion. Since these data contains useful information, how to mine valuable information from big data has become the hotspot. Faced with a flood of data, traditional stand-alone data processing has been unable to deal with, so people began to seek new solutions. Cloud computing, big data processing technology came into being. Among the different big data processing technologies, Spark is a in-memory computing framework for large-scale data processing which is popular in recent years. It is widely used for its advantage of being good at interactive and iterative calculations.In this paper, we design parallel methods of machine learning based on Spark. The paper also involves the calculation of word similarity, and we make some improvements on the method of calculation. Finally, these methods are applied to micro-blog ads serving that can achieve ads targeted delivery.In this paper, research can be divided into the following four areas:1. We propose a parallel LDA topic modeling method based on Spark. We use Gibbs sampling for the model. We divides the data set into several sub-data sets, and we assign each sub-data set to each node in for parallel processing to realize parallel LDA model.2. Bisecting K-Means clustering algorithm improved and parallel method designed. Considering the insufficiency of clustering speed which exists in the selecting the initial centroid of Bisecting K-Means clustering algorithm, we improve the algorithm by selecting the two patterns with distance maximum as the initial cluster centroid in order to accelerate clustering in clustering system. In addition, we propose a parallel method based on Spark.3. Word similarity computing method improved. Similarity computing method is based on How Net. Through the study of the How Net, we improve the word similarity computing method, the experiment shows the similarity in line with people's understanding and awareness.4. With the research results, we design a scheme of the Micro-blog advertising targeted delivery. First, we use LDA parallel algorithm and Bisecting K-Means clustering parallel algorithm to mine users' interest. Second, we use the word similarity computing method to compute the similarity of the interest-word and ad keywords, so we can advertising to interested users.
Keywords/Search Tags:Spark, LDA, Bisecting K-Means, Word similarity, micro-blog ads
PDF Full Text Request
Related items