Font Size: a A A

News Review Topic Mining Based On Clustering And LDA

Posted on:2017-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:B G LiFull Text:PDF
GTID:2308330503459636Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
News commentaries reflect public views on news events. Extracting Review subject has a high value intelligence analysis. We often have demands for news. Firstly, We want get news data。Secondly, to a series of news, we want extract topics of news and comments, which can help people to understand the attitudes toward the news. We also want to know the topic and how it changes, such as when to start, strong, weak, end or mutate into other topics.To solve the first problem, this paper presents a dynamic web crawler algorithm based on Python to solve the problem of capturing dynamic reviews page. This paper uses information of static pages structure dynamic link and designs a crawler algorithm for dynamic web. On this basis, this paper implements a comment collector. Finally, this paper compares it with the general crawler algorithm. It is proved that this algorithm has the advantages of strong pertinence, fast data acquisition, easy to be embedded, simple and so on. It provides fast access to large data sources for researchers who are not proficient in programming. For the second requirement: This paper presents an improved algorithm based LDA, which can improve the shortcomings in the existing LDA Algorithm. Topic Mining based on LDA in News Review have shortcoming. Therefore, improved model based on LDA put the same period of time comments into a text block. Next, each text block can be simplified. Finally, the fine features improved algorithms favor managers and policy makers use Comments intelligence information to make decisions. However, the algorithm based on the LDA is not easy to understand the theme; this paper presents an improved K-means algorithm, which can improve the shortcomings in the existing K-means Algorithm comment data. When k-means clustering algorithm for topic mining is applied to news comments in the Euclidean distance, it has bad clustering performance through the maximum distance method to select initial centers. To solve this problem, firstly, synonym substitution and field dictionary is introduced in the preprocessing stage to solve the problem of data sparseness and multi dimension. Secondly, the improved K-means algorithm is proposed. It selects the initial cluster centers according to maximum distance after the long comments are hidden, which solves the problem that initial centers are outliers. The method of variance inflection is proposed to deal with the problem of the traditional K-means algorithm in which k values needs to be input. It is found that the new algorithm has good clustering performance by BW-DF after BW is used to select initial centers. Finally, the effect of improved clustering algorithm is compared with the original one. The results show that the improved algorithm with high accuracy extracts opinion topic effectively.
Keywords/Search Tags:Python language, Dynamic web reviews reptile algorithm, Improved k-means clustering algorithm, Improved model based on LDA, Topic mining
PDF Full Text Request
Related items